Improve model card: Add paper/project links and `code-generation`, `datasets`, `metrics` tags

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +348 -123
README.md CHANGED
@@ -1,123 +1,348 @@
1
- ---
2
- license: apache-2.0
3
- language:
4
- - zh
5
- - en
6
- pipeline_tag: text-generation
7
- library_name: transformers
8
- ---
9
- <div align="center">
10
- <img src="https://github.com/OpenBMB/MiniCPM/blob/main/assets/minicpm_logo.png?raw=true" width="500em" ></img>
11
- </div>
12
-
13
- <p align="center">
14
- <a href="https://github.com/OpenBMB/MiniCPM/" target="_blank">GitHub Repo</a> |
15
- <a href="https://github.com/OpenBMB/MiniCPM/tree/main/report/MiniCPM_4_Technical_Report.pdf" target="_blank">Technical Report</a>
16
- </p>
17
- <p align="center">
18
- 👋 Join us on <a href="https://discord.gg/3cGQn9b3YM" target="_blank">Discord</a> and <a href="https://github.com/OpenBMB/MiniCPM/blob/main/assets/wechat.jpg" target="_blank">WeChat</a>
19
- </p>
20
-
21
- ## What's New
22
- - [2025.06.06] **MiniCPM4** series are released! This model achieves ultimate efficiency improvements while maintaining optimal performance at the same scale! It can achieve over 5x generation acceleration on typical end-side chips! You can find technical report [here](https://github.com/OpenBMB/MiniCPM/tree/main/report/MiniCPM_4_Technical_Report.pdf).🔥🔥🔥
23
-
24
- ## MiniCPM4 Series
25
- MiniCPM4 series are highly efficient large language models (LLMs) designed explicitly for end-side devices, which achieves this efficiency through systematic innovation in four key dimensions: model architecture, training data, training algorithms, and inference systems.
26
- - [MiniCPM4-8B](https://huggingface.co/openbmb/MiniCPM4-8B): The flagship of MiniCPM4, with 8B parameters, trained on 8T tokens. (**<-- you are here**)
27
- - [MiniCPM4-0.5B](https://huggingface.co/openbmb/MiniCPM4-0.5B): The small version of MiniCPM4, with 0.5B parameters, trained on 1T tokens.
28
- - [MiniCPM4-8B-Eagle-FRSpec](https://huggingface.co/openbmb/MiniCPM4-8B-Eagle-FRSpec): Eagle head for FRSpec, accelerating speculative inference for MiniCPM4-8B.
29
- - [MiniCPM4-8B-Eagle-FRSpec-QAT-cpmcu](https://huggingface.co/openbmb/MiniCPM4-8B-Eagle-FRSpec-QAT-cpmcu): Eagle head trained with QAT for FRSpec, efficiently integrate speculation and quantization to achieve ultra acceleration for MiniCPM4-8B.
30
- - [MiniCPM4-8B-Eagle-vLLM](https://huggingface.co/openbmb/MiniCPM4-8B-Eagle-vLLM): Eagle head in vLLM format, accelerating speculative inference for MiniCPM4-8B.
31
- - [MiniCPM4-8B-marlin-Eagle-vLLM](https://huggingface.co/openbmb/MiniCPM4-8B-marlin-Eagle-vLLM): Quantized Eagle head for vLLM format, accelerating speculative inference for MiniCPM4-8B.
32
- - [BitCPM4-0.5B](https://huggingface.co/openbmb/BitCPM4-0.5B): Extreme ternary quantization applied to MiniCPM4-0.5B compresses model parameters into ternary values, achieving a 90% reduction in bit width.
33
- - [BitCPM4-1B](https://huggingface.co/openbmb/BitCPM4-1B): Extreme ternary quantization applied to MiniCPM3-1B compresses model parameters into ternary values, achieving a 90% reduction in bit width.
34
- - [MiniCPM4-Survey](https://huggingface.co/openbmb/MiniCPM4-Survey): Based on MiniCPM4-8B, accepts users' quiries as input and autonomously generate trustworthy, long-form survey papers.
35
- - [MiniCPM4-MCP](https://huggingface.co/openbmb/MiniCPM4-MCP): Based on MiniCPM4-8B, accepts users' queries and available MCP tools as input and autonomously calls relevant MCP tools to satisfy users' requirements.
36
-
37
- ## Introduction
38
- MiniCPM 4 is an extremely efficient edge-side large model that has undergone efficient optimization across four dimensions: model architecture, learning algorithms, training data, and inference systems, achieving ultimate efficiency improvements.
39
-
40
- - 🏗️ **Efficient Model Architecture:**
41
- - InfLLM v2 -- Trainable Sparse Attention Mechanism: Adopts a trainable sparse attention mechanism architecture where each token only needs to compute relevance with less than 5% of tokens in 128K long text processing, significantly reducing computational overhead for long texts
42
-
43
- - 🧠 **Efficient Learning Algorithms:**
44
- - Model Wind Tunnel 2.0 -- Efficient Predictable Scaling: Introduces scaling prediction methods for performance of downstream tasks, enabling more precise model training configuration search
45
- - BitCPM -- Ultimate Ternary Quantization: Compresses model parameter bit-width to 3 values, achieving 90% extreme model bit-width reduction
46
- - Efficient Training Engineering Optimization: Adopts FP8 low-precision computing technology combined with Multi-token Prediction training strategy
47
-
48
- - 📚 **High-Quality Training Data:**
49
- - UltraClean -- High-quality Pre-training Data Filtering and Generation: Builds iterative data cleaning strategies based on efficient data verification, open-sourcing high-quality Chinese and English pre-training dataset [UltraFinweb](https://huggingface.co/datasets/openbmb/Ultra-FineWeb)
50
- - UltraChat v2 -- High-quality Supervised Fine-tuning Data Generation: Constructs large-scale high-quality supervised fine-tuning datasets covering multiple dimensions including knowledge-intensive data, reasoning-intensive data, instruction-following data, long text understanding data, and tool calling data
51
-
52
- - **Efficient Inference System:**
53
- - CPM.cu -- Lightweight and Efficient CUDA Inference Framework: Integrates sparse attention, model quantization, and speculative sampling to achieve efficient prefilling and decoding
54
- - ArkInfer -- Cross-platform Deployment System: Supports efficient deployment across multiple backend environments, providing flexible cross-platform adaptation capabilities
55
-
56
- ## Usage
57
-
58
- ### Inference with [CPM.cu](https://github.com/OpenBMB/cpm.cu)
59
-
60
- We recommend using [CPM.cu](https://github.com/OpenBMB/cpm.cu) for the inference of MiniCPM4. CPM.cu is a CUDA inference framework developed by OpenBMB, which integrates efficient sparse, speculative sampling, and quantization techniques, fully leveraging the efficiency advantages of MiniCPM4.
61
-
62
- You can install CPM.cu by running the following command:
63
-
64
- ```bash
65
- git clone https://github.com/OpenBMB/cpm.cu.git --recursive
66
- cd cpm.cu
67
- python3 setup.py install
68
- ```
69
-
70
- MiniCPM4 natively supports context lengths of up to 32,768 tokens. To reproduce the long-text acceleration effect in the paper, we recommend using the LongRoPE factors that have been validated. Change the `rope_scaling` field in the `config.json` file as the following to enable LongRoPE.
71
- ```json
72
- {
73
- ...,
74
- "rope_scaling": {
75
- "rope_type": "longrope",
76
- "long_factor": [0.9977997200264581, 1.014658295992452, 1.0349680404997148, 1.059429246056193, 1.0888815016813513, 1.1243301355211495, 1.166977103606075, 1.2182568066927284, 1.2798772354275727, 1.3538666751582975, 1.4426259039919596, 1.5489853358570191, 1.6762658237220625, 1.8283407612492941, 2.0096956085876183, 2.225478927469756, 2.481536379650452, 2.784415934557119, 3.1413289096347365, 3.560047844772632, 4.048719380066383, 4.752651957515948, 5.590913044973868, 6.584005926629993, 7.7532214876576155, 9.119754865903639, 10.704443927019176, 12.524994176518703, 14.59739595363613, 16.93214476166354, 19.53823297353041, 22.417131025031697, 25.568260840911098, 28.991144156566317, 32.68408069090375, 36.65174474170465, 40.90396065611201, 45.4664008671033, 50.37147343433591, 55.6804490772103, 61.470816952306556, 67.8622707390618, 75.00516023410414, 83.11898235973767, 92.50044360202462, 103.57086856690864, 116.9492274587385, 118.16074567836519, 119.18497548708795, 120.04810876261652, 120.77352815196981, 121.38182790207875, 121.89094985353891, 122.31638758099915, 122.6714244963338, 122.9673822552567, 123.21386397019609, 123.41898278254268, 123.58957065488238, 123.73136519024158, 123.84917421274221, 123.94701903496814, 124.02825801299717, 124.09569231686116],
77
- "short_factor": [0.9977997200264581, 1.014658295992452, 1.0349680404997148, 1.059429246056193, 1.0888815016813513, 1.1243301355211495, 1.166977103606075, 1.2182568066927284, 1.2798772354275727, 1.3538666751582975, 1.4426259039919596, 1.5489853358570191, 1.6762658237220625, 1.8283407612492941, 2.0096956085876183, 2.225478927469756, 2.481536379650452, 2.784415934557119, 3.1413289096347365, 3.560047844772632, 4.048719380066383, 4.752651957515948, 5.590913044973868, 6.584005926629993, 7.7532214876576155, 9.119754865903639, 10.704443927019176, 12.524994176518703, 14.59739595363613, 16.93214476166354, 19.53823297353041, 22.417131025031697, 25.568260840911098, 28.991144156566317, 32.68408069090375, 36.65174474170465, 40.90396065611201, 45.4664008671033, 50.37147343433591, 55.6804490772103, 61.470816952306556, 67.8622707390618, 75.00516023410414, 83.11898235973767, 92.50044360202462, 103.57086856690864, 116.9492274587385, 118.16074567836519, 119.18497548708795, 120.04810876261652, 120.77352815196981, 121.38182790207875, 121.89094985353891, 122.31638758099915, 122.6714244963338, 122.9673822552567, 123.21386397019609, 123.41898278254268, 123.58957065488238, 123.73136519024158, 123.84917421274221, 123.94701903496814, 124.02825801299717, 124.09569231686116],
78
- "original_max_position_embeddings": 32768
79
- }
80
- }
81
- ```
82
-
83
- After modification, you can run the following command to reproduce the long-context acceleration effect (the script will automatically download the model weights from HuggingFace)
84
- ```bash
85
- python3 tests/test_generate.py
86
- ```
87
-
88
- For more details about CPM.cu, please refer to [the repo CPM.cu](https://github.com/OpenBMB/cpm.cu).
89
-
90
- ## Evaluation Results
91
- On two typical end-side chips, Jetson AGX Orin and RTX 4090, MiniCPM4 demonstrates significantly faster processing speed compared to similar-size models in long text processing tasks. As text length increases, MiniCPM4's efficiency advantage becomes more pronounced. On the Jetson AGX Orin platform, compared to Qwen3-8B, MiniCPM4 achieves approximately 7x decoding speed improvement.
92
-
93
- ![benchmark](https://github.com/OpenBMB/MiniCPM/blob/main/assets/minicpm4/efficiency.png?raw=true)
94
-
95
- #### Comprehensive Evaluation
96
- MiniCPM4 launches end-side versions with 8B and 0.5B parameter scales, both achieving best-in-class performance in their respective categories.
97
-
98
- ![benchmark](https://github.com/OpenBMB/MiniCPM/blob/main/assets/minicpm4/benchmark.png?raw=true)
99
-
100
- #### Long Text Evaluation
101
- MiniCPM4 is pre-trained on 32K long texts and achieves length extension through YaRN technology. In the 128K long text needle-in-a-haystack task, MiniCPM4 demonstrates outstanding performance.
102
-
103
- ![long-niah](https://github.com/OpenBMB/MiniCPM/blob/main/assets/minicpm4/128k-niah.png?raw=true)
104
-
105
- ## Statement
106
- - As a language model, MiniCPM generates content by learning from a vast amount of text.
107
- - However, it does not possess the ability to comprehend or express personal opinions or value judgments.
108
- - Any content generated by MiniCPM does not represent the viewpoints or positions of the model developers.
109
- - Therefore, when using content generated by MiniCPM, users should take full responsibility for evaluating and verifying it on their own.
110
-
111
- ## LICENSE
112
- - This repository and MiniCPM models are released under the [Apache-2.0](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE) License.
113
-
114
- ## Citation
115
- - Please cite our [paper](https://github.com/OpenBMB/MiniCPM/tree/main/report/MiniCPM_4_Technical_Report.pdf) if you find our work valuable.
116
-
117
- ```bibtex
118
- @article{minicpm4,
119
- title={{MiniCPM4}: Ultra-Efficient LLMs on End Devices},
120
- author={MiniCPM Team},
121
- year={2025}
122
- }
123
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - zh
4
+ - en
5
+ library_name: transformers
6
+ license: apache-2.0
7
+ pipeline_tag: text-generation
8
+ tags:
9
+ - code-generation
10
+ datasets:
11
+ - openbmb/Ultra-FineWeb
12
+ metrics:
13
+ - accuracy
14
+ ---
15
+
16
+ <div align="center">
17
+ <img src="https://github.com/OpenBMB/MiniCPM/blob/main/assets/minicpm_logo.png?raw=true" width="500em" ></img>
18
+ </div>
19
+
20
+ <p align="center">
21
+ <a href="https://huggingface.co/papers/2506.07900" target="_blank">Paper</a> |
22
+ <a href="https://github.com/OpenBMB/MiniCPM/" target="_blank">GitHub Repo</a> |
23
+ <a href="https://huggingface.co/collections/openbmb/minicpm4-6841ab29d180257e940baa9b" target="_blank">Project Page</a> |
24
+ <a href="https://github.com/OpenBMB/MiniCPM/tree/main/report/MiniCPM_4_Technical_Report.pdf" target="_blank">Technical Report</a>
25
+ </p>
26
+ <p align="center">
27
+ 👋 Join us on <a href="https://discord.gg/3cGQn9b3YM" target="_blank">Discord</a> and <a href="https://github.com/OpenBMB/MiniCPM/blob/main/assets/wechat.jpg" target="_blank">WeChat</a>
28
+ </p>
29
+
30
+ ## What's New
31
+ - [2025.06.06] **MiniCPM4** series are released! This model achieves ultimate efficiency improvements while maintaining optimal performance at the same scale! It can achieve over 5x generation acceleration on typical end-side chips! You can find the paper [here](https://huggingface.co/papers/2506.07900).🔥🔥🔥
32
+
33
+ ## MiniCPM4 Series
34
+ MiniCPM4 series are highly efficient large language models (LLMs) designed explicitly for end-side devices, which achieves this efficiency through systematic innovation in four key dimensions: model architecture, training data, training algorithms, and inference systems.
35
+ - [MiniCPM4-8B](https://huggingface.co/openbmb/MiniCPM4-8B): The flagship of MiniCPM4, with 8B parameters, trained on 8T tokens. (**<-- you are here**)
36
+ - [MiniCPM4-0.5B](https://huggingface.co/openbmb/MiniCPM4-0.5B): The small version of MiniCPM4, with 0.5B parameters, trained on 1T tokens.
37
+ - [MiniCPM4-8B-Eagle-FRSpec](https://huggingface.co/openbmb/MiniCPM4-8B-Eagle-FRSpec): Eagle head for FRSpec, accelerating speculative inference for MiniCPM4-8B.
38
+ - [MiniCPM4-8B-Eagle-FRSpec-QAT-cpmcu](https://huggingface.co/openbmb/MiniCPM4-8B-Eagle-FRSpec-QAT-cpmcu): Eagle head trained with QAT for FRSpec, efficiently integrate speculation and quantization to achieve ultra acceleration for MiniCPM4-8B.
39
+ - [MiniCPM4-8B-Eagle-vLLM](https://huggingface.co/openbmb/MiniCPM4-8B-Eagle-vLLM): Eagle head in vLLM format, accelerating speculative inference for MiniCPM4-8B.
40
+ - [MiniCPM4-8B-marlin-Eagle-vLLM](https://huggingface.co/openbmb/MiniCPM4-8B-marlin-Eagle-vLLM): Quantized Eagle head for vLLM format, accelerating speculative inference for MiniCPM4-8B.
41
+ - [BitCPM4-0.5B](https://huggingface.co/openbmb/BitCPM4-0.5B): Extreme ternary quantization applied to MiniCPM4-0.5B compresses model parameters into ternary values, achieving a 90% reduction in bit width.
42
+ - [BitCPM4-1B](https://huggingface.co/openbmb/BitCPM4-1B): Extreme ternary quantization applied to MiniCPM3-1B compresses model parameters into ternary values, achieving a 90% reduction in bit width.
43
+ - [MiniCPM4-Survey](https://huggingface.co/openbmb/MiniCPM4-Survey): Based on MiniCPM4-8B, accepts users' quiries as input and autonomously generate trustworthy, long-form survey papers.
44
+ - [MiniCPM4-MCP](https://huggingface.co/openbmb/MiniCPM4-MCP): Based on MiniCPM4-8B, accepts users' queries and available MCP tools as input and autonomously calls relevant MCP tools to satisfy users' requirements.
45
+
46
+ ## Introduction
47
+ MiniCPM 4 is an extremely efficient edge-side large model that has undergone efficient optimization across four dimensions: model architecture, learning algorithms, training data, and inference systems, achieving ultimate efficiency improvements.
48
+
49
+ - 🏗️ **Efficient Model Architecture:**
50
+ - InfLLM v2 -- Trainable Sparse Attention Mechanism: Adopts a trainable sparse attention mechanism architecture where each token only needs to compute relevance with less than 5% of tokens in 128K long text processing, significantly reducing computational overhead for long texts
51
+
52
+ - 🧠 **Efficient Learning Algorithms:**
53
+ - Model Wind Tunnel 2.0 -- Efficient Predictable Scaling: Introduces scaling prediction methods for performance of downstream tasks, enabling more precise model training configuration search
54
+ - BitCPM -- Ultimate Ternary Quantization: Compresses model parameter bit-width to 3 values, achieving 90% extreme model bit-width reduction
55
+ - Efficient Training Engineering Optimization: Adopts FP8 low-precision computing technology combined with Multi-token Prediction training strategy
56
+
57
+ - 📚 **High-Quality Training Data:**
58
+ - UltraClean -- High-quality Pre-training Data Filtering and Generation: Builds iterative data cleaning strategies based on efficient data verification, open-sourcing high-quality Chinese and English pre-training dataset [UltraFinweb](https://huggingface.co/datasets/openbmb/Ultra-FineWeb)
59
+ - UltraChat v2 -- High-quality Supervised Fine-tuning Data Generation: Constructs large-scale high-quality supervised fine-tuning datasets covering multiple dimensions including knowledge-intensive data, reasoning-intensive data, instruction-following data, long text understanding data, and tool calling data
60
+
61
+ - ⚡ **Efficient Inference System:**
62
+ - CPM.cu -- Lightweight and Efficient CUDA Inference Framework: Integrates sparse attention, model quantization, and speculative sampling to achieve efficient prefilling and decoding
63
+ - ArkInfer -- Cross-platform Deployment System: Supports efficient deployment across multiple backend environments, providing flexible cross-platform adaptation capabilities
64
+
65
+ ## Usage
66
+
67
+ ### Inference with [CPM.cu](https://github.com/OpenBMB/cpm.cu)
68
+
69
+ We recommend using [CPM.cu](https://github.com/OpenBMB/cpm.cu) for the inference of MiniCPM4. CPM.cu is a CUDA inference framework developed by OpenBMB, which integrates efficient sparse, speculative sampling, and quantization techniques, fully leveraging the efficiency advantages of MiniCPM4.
70
+
71
+ You can install CPM.cu by running the following command:
72
+
73
+ ```bash
74
+ git clone https://github.com/OpenBMB/cpm.cu.git --recursive
75
+ cd cpm.cu
76
+ python3 setup.py install
77
+ ```
78
+
79
+ After modification, you can run the following command to reproduce the long-context acceleration effect (the script will automatically download the model weights from HuggingFace)
80
+ ```bash
81
+ python3 tests/long_prompt_gen.py # 生成 prompt.txt
82
+ python3 tests/test_generate.py --prompt-file prompt.txt
83
+ ```
84
+
85
+ For more details about CPM.cu, please refer to [the repo CPM.cu](https://github.com/OpenBMB/cpm.cu).
86
+
87
+ ### HuggingFace
88
+
89
+ ```python
90
+ from transformers import AutoModelForCausalLM, AutoTokenizer
91
+ import torch
92
+ torch.manual_seed(0)
93
+
94
+ path = 'openbmb/MiniCPM4-8B'
95
+ device = "cuda"
96
+ tokenizer = AutoTokenizer.from_pretrained(path)
97
+ model = AutoModelForCausalLM.from_pretrained(path, torch_dtype=torch.bfloat16, device_map=device, trust_remote_code=True)
98
+
99
+ # User can directly use the chat interface
100
+ # responds, history = model.chat(tokenizer, "Write an article about Artificial Intelligence.", temperature=0.7, top_p=0.7)
101
+ # print(responds)
102
+
103
+ # User can also use the generate interface
104
+ messages = [
105
+ {"role": "user", "content": "Write an article about Artificial Intelligence."},
106
+ ]
107
+ prompt_text = tokenizer.apply_chat_template(
108
+ messages,
109
+ tokenize=False,
110
+ add_generation_prompt=True,
111
+ )
112
+ model_inputs = tokenizer([prompt_text], return_tensors="pt").to(device)
113
+
114
+ model_outputs = model.generate(
115
+ **model_inputs,
116
+ max_new_tokens=1024,
117
+ top_p=0.7,
118
+ temperature=0.7
119
+ )
120
+ output_token_ids = [
121
+ model_outputs[i][len(model_inputs[i]):] for i in range(len(model_inputs['input_ids']))
122
+ ]
123
+
124
+ responses = tokenizer.batch_decode(output_token_ids, skip_special_tokens=True)[0]
125
+ print(responses)
126
+ ```
127
+
128
+ 本模型支持稀疏注意力机制 InfLLM v2,可高效处理长序列推理。如需启用该功能,请先安装依赖库 [infllmv2_cuda_impl](https://github.com/OpenBMB/infllmv2_cuda_impl)
129
+
130
+ 运行以下命令即可安装:
131
+
132
+ ```bash
133
+ git clone -b feature_infer https://github.com/OpenBMB/infllmv2_cuda_impl.git
134
+ cd infllmv2_cuda_impl
135
+ git submodule update --init --recursive
136
+ pip install -e . # or python setup.py install
137
+ ```
138
+
139
+ 启用 InfLLM v2 需在 `config.json` 配置文件中添加 `sparse_config` 字段:
140
+
141
+ ```json
142
+ {
143
+ ...,
144
+ "sparse_config": {
145
+ "kernel_size": 32,
146
+ "kernel_stride": 16,
147
+ "init_blocks": 1,
148
+ "block_size": 64,
149
+ "window_size": 2048,
150
+ "topk": 64,
151
+ "use_nope": false,
152
+ "dense_len": 8192
153
+ }
154
+ }
155
+ ```
156
+
157
+ These parameters control InfLLM v2 的行为:
158
+
159
+ * `kernel_size`(默认值:32):语义核的大小。
160
+ * `kernel_stride`(默认值:16):相邻语义核的步长。
161
+ * `init_blocks`(默认值:1):每个 query token 关注的初始的块数量,用于确保关注序列开头部分。
162
+ * `block_size`(默认值:64):key-value blocks 的块大小。
163
+ * `window_size`(默认值:2048):局部滑动窗口大小。
164
+ * `topk`(默认值:64):每个 token 仅与最相关的 top-k 个 key-value blocks 计算注意力。
165
+ * `use_nope`(默认值:false):是否在块选择中使用NOPE技术以提升性能。
166
+ * `dense_len`(默认值:8192):稀疏注意力对短序列收益有限,当 token 长度低于此阈值时自动切换为标准注意力。设为 `-1` 则强制始终使用稀疏注意力。
167
+
168
+ Minicpm4 原生支持 32,768 tokens 的上下文长度。若对话总长度(输入 + 输出)远超此限制,建议通过 RoPE 缩放技术扩展上下文。我们已验证通过调整 LongRoPE 因子,模型可稳定支持 131,072 tokens 的超长上下文。
169
+
170
+ 修改方法:在 `config.json` 文件中调整 `rope_scaling` 字段参数即可。
171
+
172
+ ```json
173
+ {
174
+ ...,
175
+ "rope_scaling": {
176
+ "rope_type": "longrope",
177
+ "long_factor": [
178
+ 0.9977997200264581,
179
+ 1.014658295992452,
180
+ 1.0349680404997148,
181
+ 1.059429246056193,
182
+ 1.0888815016813513,
183
+ 1.1243301355211495,
184
+ 1.166977103606075,
185
+ 1.2182568066927284,
186
+ 1.2798772354275727,
187
+ 1.3538666751582975,
188
+ 1.4426259039919596,
189
+ 1.5489853358570191,
190
+ 1.6762658237220625,
191
+ 1.8283407612492941,
192
+ 2.0096956085876183,
193
+ 2.225478927469756,
194
+ 2.481536379650452,
195
+ 2.784415934557119,
196
+ 3.1413289096347365,
197
+ 3.560047844772632,
198
+ 4.048719380066383,
199
+ 4.615569542115128,
200
+ 5.2684819496549835,
201
+ 6.014438591970396,
202
+ 6.858830049237097,
203
+ 7.804668263503327,
204
+ 8.851768731513417,
205
+ 9.99600492938444,
206
+ 11.228766118181639,
207
+ 12.536757560834843,
208
+ 13.902257701387796,
209
+ 15.303885189125953,
210
+ 16.717837610115794,
211
+ 18.119465097853947,
212
+ 19.484965238406907,
213
+ 20.792956681060105,
214
+ 22.02571786985731,
215
+ 23.16995406772833,
216
+ 24.217054535738416,
217
+ 25.16289275000465,
218
+ 26.007284207271347,
219
+ 26.753240849586767,
220
+ 27.40615325712662,
221
+ 27.973003419175363,
222
+ 28.461674954469114,
223
+ 28.880393889607006,
224
+ 29.237306864684626,
225
+ 29.540186419591297,
226
+ 29.79624387177199,
227
+ 30.01202719065413,
228
+ 30.193382037992453,
229
+ 30.34545697551969,
230
+ 30.47273746338473,
231
+ 30.579096895249787,
232
+ 30.66785612408345,
233
+ 30.741845563814174,
234
+ 30.80346599254902,
235
+ 30.85474569563567,
236
+ 30.897392663720595,
237
+ 30.932841297560394,
238
+ 30.962293553185553,
239
+ 30.986754758742034,
240
+ 31.007064503249293,
241
+ 31.02392307921529
242
+ ],
243
+ "short_factor": [
244
+ 0.9977997200264581,
245
+ 1.014658295992452,
246
+ 1.0349680404997148,
247
+ 1.059429246056193,
248
+ 1.0888815016813513,
249
+ 1.1243301355211495,
250
+ 1.166977103606075,
251
+ 1.2182568066927284,
252
+ 1.2798772354275727,
253
+ 1.3538666751582975,
254
+ 1.4426259039919596,
255
+ 1.5489853358570191,
256
+ 1.6762658237220625,
257
+ 1.8283407612492941,
258
+ 2.0096956085876183,
259
+ 2.225478927469756,
260
+ 2.481536379650452,
261
+ 2.784415934557119,
262
+ 3.1413289096347365,
263
+ 3.560047844772632,
264
+ 4.048719380066383,
265
+ 4.615569542115128,
266
+ 5.2684819496549835,
267
+ 6.014438591970396,
268
+ 6.858830049237097,
269
+ 7.804668263503327,
270
+ 8.851768731513417,
271
+ 9.99600492938444,
272
+ 11.228766118181639,
273
+ 12.536757560834843,
274
+ 13.902257701387796,
275
+ 15.303885189125953,
276
+ 16.717837610115794,
277
+ 18.119465097853947,
278
+ 19.484965238406907,
279
+ 20.792956681060105,
280
+ 22.02571786985731,
281
+ 23.16995406772833,
282
+ 24.217054535738416,
283
+ 25.16289275000465,
284
+ 26.007284207271347,
285
+ 26.753240849586767,
286
+ 27.40615325712662,
287
+ 27.973003419175363,
288
+ 28.461674954469114,
289
+ 28.880393889607006,
290
+ 29.237306864684626,
291
+ 29.540186419591297,
292
+ 29.79624387177199,
293
+ 30.01202719065413,
294
+ 30.193382037992453,
295
+ 30.34545697551969,
296
+ 30.47273746338473,
297
+ 30.579096895249787,
298
+ 30.66785612408345,
299
+ 30.741845563814174,
300
+ 30.80346599254902,
301
+ 30.85474569563567,
302
+ 30.897392663720595,
303
+ 30.932841297560394,
304
+ 30.962293553185553,
305
+ 30.986754758742034,
306
+ 31.007064503249293,
307
+ 31.02392307921529
308
+ ]
309
+ },
310
+ "original_max_position_embeddings": 32768
311
+ }
312
+ ```
313
+
314
+ ## Evaluation Results
315
+ On two typical end-side chips, Jetson AGX Orin and RTX 4090, MiniCPM4 demonstrates significantly faster processing speed compared to similar-size models in long text processing tasks. As text length increases, MiniCPM4's efficiency advantage becomes more pronounced. On the Jetson AGX Orin platform, compared to Qwen3-8B, MiniCPM4 achieves approximately 7x decoding speed improvement.
316
+
317
+ ![benchmark](https://github.com/OpenBMB/MiniCPM/blob/main/assets/minicpm4/efficiency.png?raw=true)
318
+
319
+ #### Comprehensive Evaluation
320
+ MiniCPM4 launches end-side versions with 8B and 0.5B parameter scales, both achieving best-in-class performance in their respective categories.
321
+
322
+ ![benchmark](https://github.com/OpenBMB/MiniCPM/blob/main/assets/minicpm4/benchmark.png?raw=true)
323
+
324
+ #### Long Text Evaluation
325
+ MiniCPM4 is pre-trained on 32K long texts and achieves length extension through YaRN technology. In the 128K long text needle-in-a-haystack task, MiniCPM4 demonstrates outstanding performance.
326
+
327
+ ![long-niah](https://github.com/OpenBMB/MiniCPM/blob/main/assets/minicpm4/128k-niah.png?raw=true)
328
+
329
+ ## Statement
330
+ - As a language model, MiniCPM generates content by learning from a vast amount of text.
331
+ - However, it does not possess the ability to comprehend or express personal opinions or value judgments.
332
+ - Any content generated by MiniCPM does not represent the viewpoints or positions of the model developers.
333
+ - Therefore, when using content generated by MiniCPM, users should take full responsibility for evaluating and verifying it on their own.
334
+
335
+ ## LICENSE
336
+ - This repository and MiniCPM models are released under the [Apache-2.0](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE) License.
337
+
338
+ ## Citation
339
+ - Please cite our [paper](https://huggingface.co/papers/2506.07900) if you find our work valuable.
340
+
341
+ ```bibtex
342
+ @article{minicpm4,
343
+ title={{MiniCPM4}: Ultra-Efficient LLMs on End Devices},
344
+ author={MiniCPM Team},
345
+ year={2025},
346
+ journal={arXiv preprint arXiv:2506.07900}
347
+ }
348
+ ```