xcjthu commited on
Commit
a27fb6b
·
1 Parent(s): 139dc15

update readme for better presentation

Browse files
Files changed (1) hide show
  1. README.md +82 -90
README.md CHANGED
@@ -20,11 +20,15 @@ library_name: transformers
20
  </p>
21
 
22
  ## What's New
23
- - [2025.09.05] **MiniCPM4.1** series are released! This series is a hybrid reasoning model, which can be used in both deep reasoning mode and non-reasoning mode. 🔥🔥🔥
24
  - [2025.06.06] **MiniCPM4** series are released! This model achieves ultimate efficiency improvements while maintaining optimal performance at the same scale! It can achieve over 5x generation acceleration on typical end-side chips! You can find technical report [here](https://arxiv.org/abs/2506.07900).🔥🔥🔥
25
 
26
- ## MiniCPM4 and MiniCPM4.1 Series
27
- MiniCPM4 and MiniCPM4.1 series are highly efficient large language models (LLMs) designed explicitly for end-side devices, which achieves this efficiency through systematic innovation in four key dimensions: model architecture, training data, training algorithms, and inference systems.
 
 
 
 
28
  - [MiniCPM4.1-8B](https://huggingface.co/openbmb/MiniCPM4.1-8B): The latest version of MiniCPM4, with 8B parameters, support fusion thinking. (**<-- you are here**)
29
  - [MiniCPM4.1-8B-GPTQ](https://huggingface.co/openbmb/MiniCPM4.1-8B-GPTQ): MiniCPM4.1-8B in GPTQ format.
30
  - [MiniCPM4.1-8B-AutoAWQ](https://huggingface.co/openbmb/MiniCPM4.1-8B-AutoAWQ): MiniCPM4.1-8B in AutoAWQ format.
@@ -48,89 +52,29 @@ MiniCPM4 and MiniCPM4.1 series are highly efficient large language models (LLMs)
48
  - [**MiniCPM4-MCP**](https://huggingface.co/openbmb/MiniCPM4-MCP): Integrates MCP tools to autonomously satisfy user requirements
49
  </details>
50
 
51
- ## Introduction
52
- MiniCPM4 and MiniCPM4.1 are extremely efficient edge-side large model that has undergone efficient optimization across four dimensions: model architecture, learning algorithms, training data, and inference systems, achieving ultimate efficiency improvements.
53
-
54
- - 🏗️ **Efficient Model Architecture:**
55
- - InfLLM v2 -- Trainable Sparse Attention Mechanism: Adopts a trainable sparse attention mechanism architecture where each token only needs to compute relevance with less than 5% of tokens in 128K long text processing, significantly reducing computational overhead for long texts
56
-
57
- - 🧠 **Efficient Learning Algorithms:**
58
- - Model Wind Tunnel 2.0 -- Efficient Predictable Scaling: Introduces scaling prediction methods for performance of downstream tasks, enabling more precise model training configuration search
59
- - BitCPM -- Ultimate Ternary Quantization: Compresses model parameter bit-width to 3 values, achieving 90% extreme model bit-width reduction
60
- - Efficient Training Engineering Optimization: Adopts FP8 low-precision computing technology combined with Multi-token Prediction training strategy
61
-
62
- - 📚 **High-Quality Training Data:**
63
- - UltraClean -- High-quality Pre-training Data Filtering and Generation: Builds iterative data cleaning strategies based on efficient data verification, open-sourcing high-quality Chinese and English pre-training dataset [UltraFinweb](https://huggingface.co/datasets/openbmb/Ultra-FineWeb)
64
- - UltraChat v2 -- High-quality Supervised Fine-tuning Data Generation: Constructs large-scale high-quality supervised fine-tuning datasets covering multiple dimensions including knowledge-intensive data, reasoning-intensive data, instruction-following data, long text understanding data, and tool calling data
65
-
66
- - ⚡ **Efficient Inference System:**
67
- - CPM.cu -- Lightweight and Efficient CUDA Inference Framework: Integrates sparse attention, model quantization, and speculative sampling to achieve efficient prefilling and decoding
68
- - ArkInfer -- Cross-platform Deployment System: Supports efficient deployment across multiple backend environments, providing flexible cross-platform adaptation capabilities
69
-
70
- ## Usage
71
-
72
- ### Inference with [CPM.cu](https://github.com/OpenBMB/cpm.cu)
73
 
74
- We recommend using [CPM.cu](https://github.com/OpenBMB/cpm.cu) for the inference of MiniCPM4 and MiniCPM4.1. CPM.cu is a CUDA inference framework developed by OpenBMB, which integrates efficient sparse, speculative sampling, and quantization techniques, fully leveraging the efficiency advantages of MiniCPM4 and MiniCPM4.1.
75
 
76
- You can install CPM.cu by running the following command:
 
77
 
78
- ```bash
79
- git clone https://github.com/OpenBMB/cpm.cu.git --recursive
80
- cd cpm.cu
81
- python3 setup.py install
82
- ```
83
 
84
- MiniCPM4.1 natively supports context lengths of up to 65,536(64k) tokens. To reproduce the long-text acceleration effect in the paper, we recommend using the LongRoPE factors that have been validated. Change the `rope_scaling` field in the `config.json` file as the following to enable LongRoPE.
85
- ```json
86
- {
87
- ...,
88
- "rope_scaling": {
89
- "rope_type": "longrope",
90
- "long_factor": [0.9982316082870437, 1.033048153422584, 1.0749920956484724, 1.1255096879436193, 1.1863348602111476, 1.259543828902579, 1.3476188888731149, 1.4535223827776373, 1.5807816745852985, 1.7335856049489526, 1.9168922912975785, 2.1365471404135326, 2.3994084200118646, 2.713475511863602, 3.0880118452194134, 3.533650295140154, 4.062463396503134, 4.687974098908333, 5.425075306704039, 6.289818967956352, 7.29902962722721, 8.6357018163639, 10.210822723989212, 12.053807765671676, 14.193944598909404, 16.65780676784363, 19.463620727694074, 22.628311203524586, 26.150106147261315, 30.02526691405111, 34.23183327975347, 38.73811934094828, 43.502489489729555, 48.47627117965394, 53.61139491762471, 58.857366522037935, 64.16798299215064, 69.51359464319125, 74.86555458220285, 80.21497790341579, 85.55322183307433, 90.89611806932027, 96.26245306514224, 101.68269304046481, 107.18619510219668, 112.82253283014026, 118.63764063163615, 119.88866203644656, 120.9462882391725, 121.837565139014, 122.58663780572562, 123.2147719894291, 123.74049454862576, 124.17980424685767, 124.54641761955492, 124.85202548028222, 125.10654406389756, 125.31835105170659, 125.49450117164764, 125.64091910903052, 125.76256945356558, 125.86360463815589, 125.94749252260765, 126.01712561287873],
91
- "short_factor": [0.9982316082870437, 1.033048153422584, 1.0749920956484724, 1.1255096879436193, 1.1863348602111476, 1.259543828902579, 1.3476188888731149, 1.4535223827776373, 1.5807816745852985, 1.7335856049489526, 1.9168922912975785, 2.1365471404135326, 2.3994084200118646, 2.713475511863602, 3.0880118452194134, 3.533650295140154, 4.062463396503134, 4.687974098908333, 5.425075306704039, 6.289818967956352, 7.29902962722721, 8.6357018163639, 10.210822723989212, 12.053807765671676, 14.193944598909404, 16.65780676784363, 19.463620727694074, 22.628311203524586, 26.150106147261315, 30.02526691405111, 34.23183327975347, 38.73811934094828, 43.502489489729555, 48.47627117965394, 53.61139491762471, 58.857366522037935, 64.16798299215064, 69.51359464319125, 74.86555458220285, 80.21497790341579, 85.55322183307433, 90.89611806932027, 96.26245306514224, 101.68269304046481, 107.18619510219668, 112.82253283014026, 118.63764063163615, 119.88866203644656, 120.9462882391725, 121.837565139014, 122.58663780572562, 123.2147719894291, 123.74049454862576, 124.17980424685767, 124.54641761955492, 124.85202548028222, 125.10654406389756, 125.31835105170659, 125.49450117164764, 125.64091910903052, 125.76256945356558, 125.86360463815589, 125.94749252260765, 126.01712561287873],
92
- "original_max_position_embeddings": 65536
93
- }
94
- }
95
- ```
96
 
97
- After modification, you can run the following command to reproduce the long-context acceleration effect (the script will automatically download the model weights from HuggingFace)
98
- ```bash
99
- python3 tests/test_generate.py
100
- ```
101
 
102
- You can run the following command to infer with EAGLE3 speculative decoding algorithm.
 
 
 
103
 
104
- ```bash
105
- python3 -m cpmcu.cli \
106
- --model-path $BASE_MODEL_PATH \
107
- --draft-model-path $EAGLE3_DRAFT_MODEL_PATH \
108
- --prompt-text "Write an article about Artificial Intelligence." \
109
- --use-eagle3 true
110
- ```
111
 
112
- For more details about CPM.cu, please refer to [the repo CPM.cu](https://github.com/OpenBMB/cpm.cu).
113
 
114
- ### Hybird Reasoning Mode
 
115
 
116
- MiniCPM4.1 supports hybrid reasoning mode, which can be used in both deep reasoning mode and non-reasoning mode. To enable hybrid reasoning mode. User can set `enable_thinking=True` in `tokenizer.apply_chat_template` to enable hybrid reasoning mode, and set `enable_thinking=False` to enable non-reasoning mode. Similarly, user can directly add `/no_think` at the end of the query to enable non-reasoning mode. If not add any special token or add `/think` at the end of the query, the model will enable reasoning mode.
117
-
118
- ```python
119
- # Enable reasoning mode
120
- prompt_text = tokenizer.apply_chat_template(
121
- messages,
122
- tokenize=False,
123
- add_generation_prompt=True,
124
- enable_thinking=True
125
- )
126
- # Enable non-reasoning mode
127
- prompt_text = tokenizer.apply_chat_template(
128
- messages,
129
- tokenize=False,
130
- add_generation_prompt=True,
131
- enable_thinking=False
132
- )
133
- ```
134
 
135
  ### Inference with Transformers
136
  ```python
@@ -226,6 +170,8 @@ You can apply the LongRoPE factor modification by modifying the model files. Spe
226
 
227
  ### Inference with [SGLang](https://github.com/sgl-project/sglang)
228
 
 
 
229
  #### Speculative Decoding
230
 
231
  For accelerated inference with speculative decoding, follow these steps:
@@ -246,7 +192,7 @@ The EAGLE3 adaptation PR has been submitted. For now, use our repository for ins
246
  ```bash
247
  git clone https://github.com/LDLINGLINGLING/sglang.git
248
  cd sglang
249
- pip install -e .
250
  ```
251
 
252
  ##### 3. Launch SGLang Server with Speculative Decoding
@@ -337,6 +283,7 @@ print(response.choices[0].message.content)
337
  ```
338
 
339
  ### Inference with [vLLM](https://github.com/vllm-project/vllm)
 
340
 
341
  #### Speculative Decoding
342
 
@@ -344,7 +291,7 @@ For accelerated inference with speculative decoding using vLLM, follow these ste
344
 
345
  ##### 1. Download MiniCPM4.1 Draft Model
346
 
347
- First, download the MiniCPM4.1 draft model:
348
 
349
  ```bash
350
  cd /your_path
@@ -450,7 +397,7 @@ Also, you can start the inference server by running the following command:
450
  > **Note**: In vLLM's chat API, `add_special_tokens` is `False` by default. This means important special tokens—such as the beginning-of-sequence (BOS) token—will not be added automatically. To ensure the input prompt is correctly formatted for the model, you should explicitly set `extra_body={"add_special_tokens": True}`.
451
 
452
  ```bash
453
- vllm serve openbmb/MiniCPM4.1-8B
454
  ```
455
 
456
  Then you can use the chat interface by running the following code:
@@ -474,24 +421,69 @@ response = client.chat.completions.create(
474
  print(response.choices[0].message.content)
475
  ```
476
 
477
- ## Evaluation Results
478
- On two typical end-side chips, Jetson AGX Orin and RTX 4090, MiniCPM4 demonstrates significantly faster processing speed compared to similar-size models in long text processing tasks. As text length increases, MiniCPM4's efficiency advantage becomes more pronounced. On the Jetson AGX Orin platform, compared to Qwen3-8B, MiniCPM4 achieves approximately 7x decoding speed improvement.
479
 
480
- ![benchmark](https://github.com/OpenBMB/MiniCPM/blob/main/assets/minicpm4/efficiency.png?raw=true)
481
 
482
- MiniCPM4.1 achieves 3x decoding speed improvement in reasoning.
483
 
484
- ![benchmark](https://github.com/OpenBMB/MiniCPM/blob/main/assets/minicpm4/minicpm4.1_speed.png?raw=true)
485
 
486
- #### Comprehensive Evaluation
487
- MiniCPM4.1 launches end-side versions with 8B parameter scale, both achieving best-in-class performance in their respective categories.
 
 
 
488
 
489
- ![benchmark](https://github.com/OpenBMB/MiniCPM/blob/main/assets/minicpm4/benchmark4.1.png?raw=true)
 
 
 
 
 
 
 
 
 
 
 
490
 
491
- #### Long Text Evaluation
492
- MiniCPM4 is pre-trained on 32K long texts and achieves length extension through YaRN technology. In the 128K long text needle-in-a-haystack task, MiniCPM4 demonstrates outstanding performance.
 
 
493
 
494
- ![long-niah](https://github.com/OpenBMB/MiniCPM/blob/main/assets/minicpm4/128k-niah.png?raw=true)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
495
 
496
  ## Statement
497
  - As a language model, MiniCPM generates content by learning from a vast amount of text.
 
20
  </p>
21
 
22
  ## What's New
23
+ - [2025.09.05] **MiniCPM4.1** series are released! This series is a hybrid reasoning model with trainable sparse attention, which can be used in both deep reasoning mode and non-reasoning mode. 🔥🔥🔥
24
  - [2025.06.06] **MiniCPM4** series are released! This model achieves ultimate efficiency improvements while maintaining optimal performance at the same scale! It can achieve over 5x generation acceleration on typical end-side chips! You can find technical report [here](https://arxiv.org/abs/2506.07900).🔥🔥🔥
25
 
26
+ ## Highlights
27
+ MiniCPM4.1 is highlighted with following features:
28
+ ✅ Strong Reasoning Capability: Surpasses similar-sized models on 15 tasks!
29
+ ✅ Fast Generation: 3x decoding speedup for reasoning
30
+ ✅ Efficient Architecture: Trainable sparse attention, frequency-ranked speculative decoding
31
+
32
  - [MiniCPM4.1-8B](https://huggingface.co/openbmb/MiniCPM4.1-8B): The latest version of MiniCPM4, with 8B parameters, support fusion thinking. (**<-- you are here**)
33
  - [MiniCPM4.1-8B-GPTQ](https://huggingface.co/openbmb/MiniCPM4.1-8B-GPTQ): MiniCPM4.1-8B in GPTQ format.
34
  - [MiniCPM4.1-8B-AutoAWQ](https://huggingface.co/openbmb/MiniCPM4.1-8B-AutoAWQ): MiniCPM4.1-8B in AutoAWQ format.
 
52
  - [**MiniCPM4-MCP**](https://huggingface.co/openbmb/MiniCPM4-MCP): Integrates MCP tools to autonomously satisfy user requirements
53
  </details>
54
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
55
 
56
+ ## Evaluation Results
57
 
58
+ ### Performance Evaluation
59
+ MiniCPM4.1 launches end-side versions with 8B parameter scale, both achieving best-in-class performance in their respective categories.
60
 
61
+ ![benchmark](https://github.com/OpenBMB/MiniCPM/blob/main/assets/minicpm4/benchmark4.1.png?raw=true)
 
 
 
 
62
 
63
+ ### Efficiency Evaluation
64
+ MiniCPM4.1 adopts sparse attention and speculative decoding to improve the inference efficiency. On RTX 4090, MiniCPM4.1 achieves 3x decoding speed improvement in reasoning.
 
 
 
 
 
 
 
 
 
 
65
 
66
+ ![benchmark](https://github.com/OpenBMB/MiniCPM/blob/main/assets/minicpm4/minicpm4.1_speed.png?raw=true)
 
 
 
67
 
68
+ #### Examples
69
+ <div align="center">
70
+ <a href="https://www.youtube.com/watch?v=VouXjLHKDUY"><img src="https://img.youtube.com/vi/VouXjLHKDUY/0.jpg", width=70%></a>
71
+ </div>
72
 
 
 
 
 
 
 
 
73
 
 
74
 
75
+ ## Usage
76
+ MiniCPM 4.1 can be used with following frameworks: Huggingface Transformers, SGLang, vLLM, and CPM.cu. For the ultimate inference speed, we highly recommend CPM.cu.
77
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
78
 
79
  ### Inference with Transformers
80
  ```python
 
170
 
171
  ### Inference with [SGLang](https://github.com/sgl-project/sglang)
172
 
173
+ You can inference with SGLang using the standard mode and speculative decoding mode.
174
+
175
  #### Speculative Decoding
176
 
177
  For accelerated inference with speculative decoding, follow these steps:
 
192
  ```bash
193
  git clone https://github.com/LDLINGLINGLING/sglang.git
194
  cd sglang
195
+ pip install -e "python[all]"
196
  ```
197
 
198
  ##### 3. Launch SGLang Server with Speculative Decoding
 
283
  ```
284
 
285
  ### Inference with [vLLM](https://github.com/vllm-project/vllm)
286
+ You can inference with vLLM using the standard mode and speculative decoding mode.
287
 
288
  #### Speculative Decoding
289
 
 
291
 
292
  ##### 1. Download MiniCPM4.1 Draft Model
293
 
294
+ First, download the MiniCPM4.1 draft model and change the `architectures` in config.json as `LlamaForCausalLM`.
295
 
296
  ```bash
297
  cd /your_path
 
397
  > **Note**: In vLLM's chat API, `add_special_tokens` is `False` by default. This means important special tokens—such as the beginning-of-sequence (BOS) token—will not be added automatically. To ensure the input prompt is correctly formatted for the model, you should explicitly set `extra_body={"add_special_tokens": True}`.
398
 
399
  ```bash
400
+ vllm serve openbmb/MiniCPM4.1-8B --trust-remote-code
401
  ```
402
 
403
  Then you can use the chat interface by running the following code:
 
421
  print(response.choices[0].message.content)
422
  ```
423
 
 
 
424
 
425
+ ### Inference with [CPM.cu](https://github.com/OpenBMB/cpm.cu)
426
 
427
+ We recommend using [CPM.cu](https://github.com/OpenBMB/cpm.cu) for the inference of MiniCPM4 and MiniCPM4.1. CPM.cu is a CUDA inference framework developed by OpenBMB, which integrates efficient sparse, speculative sampling, and quantization techniques, fully leveraging the efficiency advantages of MiniCPM4 and MiniCPM4.1.
428
 
429
+ You can install CPM.cu by running the following command:
430
 
431
+ ```bash
432
+ git clone https://github.com/OpenBMB/cpm.cu.git --recursive
433
+ cd cpm.cu
434
+ python3 setup.py install
435
+ ```
436
 
437
+ MiniCPM4.1 natively supports context lengths of up to 65,536(64k) tokens. To reproduce the long-text acceleration effect in the paper, we recommend using the LongRoPE factors that have been validated. Change the `rope_scaling` field in the `config.json` file as the following to enable LongRoPE.
438
+ ```json
439
+ {
440
+ ...,
441
+ "rope_scaling": {
442
+ "rope_type": "longrope",
443
+ "long_factor": [0.9982316082870437, 1.033048153422584, 1.0749920956484724, 1.1255096879436193, 1.1863348602111476, 1.259543828902579, 1.3476188888731149, 1.4535223827776373, 1.5807816745852985, 1.7335856049489526, 1.9168922912975785, 2.1365471404135326, 2.3994084200118646, 2.713475511863602, 3.0880118452194134, 3.533650295140154, 4.062463396503134, 4.687974098908333, 5.425075306704039, 6.289818967956352, 7.29902962722721, 8.6357018163639, 10.210822723989212, 12.053807765671676, 14.193944598909404, 16.65780676784363, 19.463620727694074, 22.628311203524586, 26.150106147261315, 30.02526691405111, 34.23183327975347, 38.73811934094828, 43.502489489729555, 48.47627117965394, 53.61139491762471, 58.857366522037935, 64.16798299215064, 69.51359464319125, 74.86555458220285, 80.21497790341579, 85.55322183307433, 90.89611806932027, 96.26245306514224, 101.68269304046481, 107.18619510219668, 112.82253283014026, 118.63764063163615, 119.88866203644656, 120.9462882391725, 121.837565139014, 122.58663780572562, 123.2147719894291, 123.74049454862576, 124.17980424685767, 124.54641761955492, 124.85202548028222, 125.10654406389756, 125.31835105170659, 125.49450117164764, 125.64091910903052, 125.76256945356558, 125.86360463815589, 125.94749252260765, 126.01712561287873],
444
+ "short_factor": [0.9982316082870437, 1.033048153422584, 1.0749920956484724, 1.1255096879436193, 1.1863348602111476, 1.259543828902579, 1.3476188888731149, 1.4535223827776373, 1.5807816745852985, 1.7335856049489526, 1.9168922912975785, 2.1365471404135326, 2.3994084200118646, 2.713475511863602, 3.0880118452194134, 3.533650295140154, 4.062463396503134, 4.687974098908333, 5.425075306704039, 6.289818967956352, 7.29902962722721, 8.6357018163639, 10.210822723989212, 12.053807765671676, 14.193944598909404, 16.65780676784363, 19.463620727694074, 22.628311203524586, 26.150106147261315, 30.02526691405111, 34.23183327975347, 38.73811934094828, 43.502489489729555, 48.47627117965394, 53.61139491762471, 58.857366522037935, 64.16798299215064, 69.51359464319125, 74.86555458220285, 80.21497790341579, 85.55322183307433, 90.89611806932027, 96.26245306514224, 101.68269304046481, 107.18619510219668, 112.82253283014026, 118.63764063163615, 119.88866203644656, 120.9462882391725, 121.837565139014, 122.58663780572562, 123.2147719894291, 123.74049454862576, 124.17980424685767, 124.54641761955492, 124.85202548028222, 125.10654406389756, 125.31835105170659, 125.49450117164764, 125.64091910903052, 125.76256945356558, 125.86360463815589, 125.94749252260765, 126.01712561287873],
445
+ "original_max_position_embeddings": 65536
446
+ }
447
+ }
448
+ ```
449
 
450
+ After modification, you can run the following command to reproduce the long-context acceleration effect (the script will automatically download the model weights from HuggingFace)
451
+ ```bash
452
+ python3 tests/test_generate.py
453
+ ```
454
 
455
+ You can run the following command to infer with EAGLE3 speculative decoding algorithm.
456
+
457
+ ```bash
458
+ python3 -m cpmcu.cli \
459
+ --model-path $BASE_MODEL_PATH \
460
+ --draft-model-path $EAGLE3_DRAFT_MODEL_PATH \
461
+ --prompt-text "Write an article about Artificial Intelligence." \
462
+ --use-eagle3 true
463
+ ```
464
+
465
+ For more details about CPM.cu, please refer to [the repo CPM.cu](https://github.com/OpenBMB/cpm.cu).
466
+
467
+ ### Hybird Reasoning Mode
468
+
469
+ MiniCPM4.1 supports hybrid reasoning mode, which can be used in both deep reasoning mode and non-reasoning mode. To enable hybrid reasoning mode. User can set `enable_thinking=True` in `tokenizer.apply_chat_template` to enable hybrid reasoning mode, and set `enable_thinking=False` to enable non-reasoning mode. Similarly, user can directly add `/no_think` at the end of the query to enable non-reasoning mode. If not add any special token or add `/think` at the end of the query, the model will enable reasoning mode.
470
+
471
+ ```python
472
+ # Enable reasoning mode
473
+ prompt_text = tokenizer.apply_chat_template(
474
+ messages,
475
+ tokenize=False,
476
+ add_generation_prompt=True,
477
+ enable_thinking=True
478
+ )
479
+ # Enable non-reasoning mode
480
+ prompt_text = tokenizer.apply_chat_template(
481
+ messages,
482
+ tokenize=False,
483
+ add_generation_prompt=True,
484
+ enable_thinking=False
485
+ )
486
+ ```
487
 
488
  ## Statement
489
  - As a language model, MiniCPM generates content by learning from a vast amount of text.