BigDong commited on
Commit
41c8915
·
1 Parent(s): f482c7a

update README.md

Browse files
Files changed (1) hide show
  1. README.md +164 -4
README.md CHANGED
@@ -12,14 +12,14 @@ library_name: transformers
12
 
13
  <p align="center">
14
  <a href="https://github.com/OpenBMB/MiniCPM/" target="_blank">GitHub Repo</a> |
15
- <a href="TODO" target="_blank">Technical Report</a>
16
  </p>
17
  <p align="center">
18
  👋 Join us on <a href="https://discord.gg/3cGQn9b3YM" target="_blank">Discord</a> and <a href="https://github.com/OpenBMB/MiniCPM/blob/main/assets/wechat.jpg" target="_blank">WeChat</a>
19
  </p>
20
 
21
  ## What's New
22
- - [2025.06.06] **MiniCPM4** series are released! This model achieves ultimate efficiency improvements while maintaining optimal performance at the same scale! It can achieve over 5x generation acceleration on typical end-side chips! You can find technical report on [arXiv](TODO).🔥🔥🔥
23
 
24
  ## MiniCPM4 Series
25
  MiniCPM4 series are highly efficient large language models (LLMs) designed explicitly for end-side devices, which achieves this efficiency through systematic innovation in four key dimensions: model architecture, training data, training algorithms, and inference systems.
@@ -52,9 +52,165 @@ MiniCPM 4 is an extremely efficient edge-side large model that has undergone eff
52
  - ArkInfer -- Cross-platform Deployment System: Supports efficient deployment across multiple backend environments, providing flexible cross-platform adaptation capabilities
53
 
54
  ## Usage
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
55
  ### Inference with Transformers
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
56
 
57
  ### Inference with [vLLM](https://github.com/vllm-project/vllm)
 
 
 
 
58
 
59
  ## Evaluation Results
60
  On two typical end-side chips, Jetson AGX Orin and RTX 4090, MiniCPM4 demonstrates significantly faster processing speed compared to similar-size models in long text processing tasks. As text length increases, MiniCPM4's efficiency advantage becomes more pronounced. On the Jetson AGX Orin platform, compared to Qwen3-8B, MiniCPM4 achieves approximately 7x decoding speed improvement.
@@ -84,8 +240,12 @@ MiniCPM4 is pre-trained on 32K long texts and achieves length extension through
84
 
85
  ## Citation
86
 
87
- - Please cite our [paper](TODO) if you find our work valuable.
88
 
89
  ```bibtex
90
- TODO
 
 
 
 
91
  ```
 
12
 
13
  <p align="center">
14
  <a href="https://github.com/OpenBMB/MiniCPM/" target="_blank">GitHub Repo</a> |
15
+ <a href="https://github.com/OpenBMB/MiniCPM/tree/main/report/MiniCPM_4_Technical_Report.pdf" target="_blank">Technical Report</a>
16
  </p>
17
  <p align="center">
18
  👋 Join us on <a href="https://discord.gg/3cGQn9b3YM" target="_blank">Discord</a> and <a href="https://github.com/OpenBMB/MiniCPM/blob/main/assets/wechat.jpg" target="_blank">WeChat</a>
19
  </p>
20
 
21
  ## What's New
22
+ - [2025.06.06] **MiniCPM4** series are released! This model achieves ultimate efficiency improvements while maintaining optimal performance at the same scale! It can achieve over 5x generation acceleration on typical end-side chips! You can find technical report [here](https://github.com/OpenBMB/MiniCPM/tree/main/report/MiniCPM_4_Technical_Report.pdf).🔥🔥🔥
23
 
24
  ## MiniCPM4 Series
25
  MiniCPM4 series are highly efficient large language models (LLMs) designed explicitly for end-side devices, which achieves this efficiency through systematic innovation in four key dimensions: model architecture, training data, training algorithms, and inference systems.
 
52
  - ArkInfer -- Cross-platform Deployment System: Supports efficient deployment across multiple backend environments, providing flexible cross-platform adaptation capabilities
53
 
54
  ## Usage
55
+
56
+ ### Inference with [CPM.cu](https://github.com/OpenBMB/cpm.cu)
57
+
58
+ We recommend using [CPM.cu](https://github.com/OpenBMB/cpm.cu) for the inference of MiniCPM4. CPM.cu is a CUDA inference framework developed by OpenBMB, which integrates efficient sparse, speculative sampling, and quantization techniques, fully leveraging the efficiency advantages of MiniCPM4.
59
+
60
+ You can install CPM.cu by running the following command:
61
+
62
+ ```bash
63
+ git clone https://github.com/OpenBMB/cpm.cu.git --recursive
64
+ cd cpm.cu
65
+ python3 setup.py install
66
+ ```
67
+
68
+ MiniCPM4 natively supports context lengths of up to 32,768 tokens. To reproduce the long-text acceleration effect in the paper, we recommend using the LongRoPE factors that have been validated. Change the `rope_scaling` field in the `config.json` file as the following to enable LongRoPE.
69
+ ```json
70
+ {
71
+ ...,
72
+ "rope_scaling": {
73
+ "rope_type": "longrope",
74
+ "long_factor": [0.9977997200264581, 1.014658295992452, 1.0349680404997148, 1.059429246056193, 1.0888815016813513, 1.1243301355211495, 1.166977103606075, 1.2182568066927284, 1.2798772354275727, 1.3538666751582975, 1.4426259039919596, 1.5489853358570191, 1.6762658237220625, 1.8283407612492941, 2.0096956085876183, 2.225478927469756, 2.481536379650452, 2.784415934557119, 3.1413289096347365, 3.560047844772632, 4.048719380066383, 4.752651957515948, 5.590913044973868, 6.584005926629993, 7.7532214876576155, 9.119754865903639, 10.704443927019176, 12.524994176518703, 14.59739595363613, 16.93214476166354, 19.53823297353041, 22.417131025031697, 25.568260840911098, 28.991144156566317, 32.68408069090375, 36.65174474170465, 40.90396065611201, 45.4664008671033, 50.37147343433591, 55.6804490772103, 61.470816952306556, 67.8622707390618, 75.00516023410414, 83.11898235973767, 92.50044360202462, 103.57086856690864, 116.9492274587385, 118.16074567836519, 119.18497548708795, 120.04810876261652, 120.77352815196981, 121.38182790207875, 121.89094985353891, 122.31638758099915, 122.6714244963338, 122.9673822552567, 123.21386397019609, 123.41898278254268, 123.58957065488238, 123.73136519024158, 123.84917421274221, 123.94701903496814, 124.02825801299717, 124.09569231686116],
75
+ "short_factor": [0.9977997200264581, 1.014658295992452, 1.0349680404997148, 1.059429246056193, 1.0888815016813513, 1.1243301355211495, 1.166977103606075, 1.2182568066927284, 1.2798772354275727, 1.3538666751582975, 1.4426259039919596, 1.5489853358570191, 1.6762658237220625, 1.8283407612492941, 2.0096956085876183, 2.225478927469756, 2.481536379650452, 2.784415934557119, 3.1413289096347365, 3.560047844772632, 4.048719380066383, 4.752651957515948, 5.590913044973868, 6.584005926629993, 7.7532214876576155, 9.119754865903639, 10.704443927019176, 12.524994176518703, 14.59739595363613, 16.93214476166354, 19.53823297353041, 22.417131025031697, 25.568260840911098, 28.991144156566317, 32.68408069090375, 36.65174474170465, 40.90396065611201, 45.4664008671033, 50.37147343433591, 55.6804490772103, 61.470816952306556, 67.8622707390618, 75.00516023410414, 83.11898235973767, 92.50044360202462, 103.57086856690864, 116.9492274587385, 118.16074567836519, 119.18497548708795, 120.04810876261652, 120.77352815196981, 121.38182790207875, 121.89094985353891, 122.31638758099915, 122.6714244963338, 122.9673822552567, 123.21386397019609, 123.41898278254268, 123.58957065488238, 123.73136519024158, 123.84917421274221, 123.94701903496814, 124.02825801299717, 124.09569231686116],
76
+ "original_max_position_embeddings": 32768
77
+ }
78
+ }
79
+ ```
80
+
81
+ After modification, you can run the following command to reproduce the long-context acceleration effect (the script will automatically download the model weights from HuggingFace)
82
+ ```bash
83
+ python3 tests/test_generate.py
84
+ ```
85
+
86
+ For more details about CPM.cu, please refer to [the repo CPM.cu](https://github.com/OpenBMB/cpm.cu).
87
+
88
  ### Inference with Transformers
89
+ ```python
90
+ from transformers import AutoModelForCausalLM, AutoTokenizer
91
+ import torch
92
+ torch.manual_seed(0)
93
+
94
+ path = 'openbmb/MiniCPM4-8B'
95
+ device = "cuda"
96
+ tokenizer = AutoTokenizer.from_pretrained(path)
97
+ model = AutoModelForCausalLM.from_pretrained(path, torch_dtype=torch.bfloat16, device_map=device, trust_remote_code=True)
98
+
99
+ # User can directly use the chat interface
100
+ # responds, history = model.chat(tokenizer, "Write an article about Artificial Intelligence.", temperature=0.7, top_p=0.7)
101
+ # print(responds)
102
+
103
+ # User can also use the generate interface
104
+ messages = [
105
+ {"role": "user", "content": "Write an article about Artificial Intelligence."},
106
+ ]
107
+ model_inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(device)
108
+
109
+ model_outputs = model.generate(
110
+ model_inputs,
111
+ max_new_tokens=1024,
112
+ top_p=0.7,
113
+ temperature=0.7
114
+ )
115
+ output_token_ids = [
116
+ model_outputs[i][len(model_inputs[i]):] for i in range(len(model_inputs))
117
+ ]
118
+
119
+ responses = tokenizer.batch_decode(output_token_ids, skip_special_tokens=True)[0]
120
+ print(responses)
121
+ ```
122
+
123
+ MiniCPM4-8B supports `InfLLM v2`, a sparse attention mechanism designed for efficient long-sequence inference. It requires the [infllmv2_cuda_impl](https://github.com/OpenBMB/infllmv2_cuda_impl) library.
124
+
125
+ You can install it by running the following command:
126
+ ```bash
127
+ git clone -b feature_infer https://github.com/OpenBMB/infllmv2_cuda_impl.git
128
+ cd infllmv2_cuda_impl
129
+ git submodule update --init --recursive
130
+ pip install -e . # or python setup.py install
131
+ ```
132
+
133
+ To enable InfLLM v2, you need to add the `sparse_config` field in `config.json`:
134
+ ```json
135
+ {
136
+ ...,
137
+ "sparse_config": {
138
+ "kernel_size": 32,
139
+ "kernel_stride": 16,
140
+ "init_blocks": 1,
141
+ "block_size": 64,
142
+ "window_size": 2048,
143
+ "topk": 64,
144
+ "use_nope": false,
145
+ "dense_len": 8192
146
+ }
147
+ }
148
+ ```
149
+
150
+ These parameters control the behavior of InfLLM v2:
151
+ * `kernel_size` (default: 32): size of semantic kernels.
152
+ * `kernel_stride` (default: 16): stride between adjacent kernels.
153
+ * `init_blocks` (default: 1): The number of initial blocks that every query token attends to. This ensures attention to the beginning of the sequence.
154
+ * `block_size` (default: 64): block size for key-value blocks.
155
+ * `window_size` (default: 2048): The size of the local sliding window.
156
+ * `topk` (default: 64): Specifies that each token computes attention with only the top-k most relevant key-value blocks.
157
+ * `use_nope` (default: false): whether to use the NOPE technique in block selection for improved performance.
158
+ * `dense_len` (default: 8192): Since Sparse Attention offers limited benefits for short sequences, the model can use standard (dense) attention for shorter texts. The model will use dense attention for sequences with a token length below `dense_len` and switch to sparse attention for sequences exceeding this length. Set this to `-1` to always use sparse attention regardless of sequence length.
159
+
160
+ MiniCPM4 natively supports context lengths of up to 32,768 tokens. For conversations where the total length (including both input and output) significantly exceeds this limit, we recommend using RoPE scaling techniques for effective handling of long texts. We have validated the model's performance on context lengths of up to 131,072 tokens by modifying the LongRoPE factor.
161
+
162
+ You can apply the LongRoPE factor modification by modifying the model files. Specifically, in the `config.json` file, adjust the `rope_scaling` fields.
163
+ ```json
164
+ {
165
+ ...,
166
+ "rope_scaling": {
167
+ "rope_type": "longrope",
168
+ "long_factor": [0.9977997200264581, 1.014658295992452, 1.0349680404997148, 1.059429246056193, 1.0888815016813513, 1.1243301355211495, 1.166977103606075, 1.2182568066927284, 1.2798772354275727, 1.3538666751582975, 1.4426259039919596, 1.5489853358570191, 1.6762658237220625, 1.8283407612492941, 2.0096956085876183, 2.225478927469756, 2.481536379650452, 2.784415934557119, 3.1413289096347365, 3.560047844772632, 4.048719380066383, 4.752651957515948, 5.590913044973868, 6.584005926629993, 7.7532214876576155, 9.119754865903639, 10.704443927019176, 12.524994176518703, 14.59739595363613, 16.93214476166354, 19.53823297353041, 22.417131025031697, 25.568260840911098, 28.991144156566317, 32.68408069090375, 36.65174474170465, 40.90396065611201, 45.4664008671033, 50.37147343433591, 55.6804490772103, 61.470816952306556, 67.8622707390618, 75.00516023410414, 83.11898235973767, 92.50044360202462, 103.57086856690864, 116.9492274587385, 118.16074567836519, 119.18497548708795, 120.04810876261652, 120.77352815196981, 121.38182790207875, 121.89094985353891, 122.31638758099915, 122.6714244963338, 122.9673822552567, 123.21386397019609, 123.41898278254268, 123.58957065488238, 123.73136519024158, 123.84917421274221, 123.94701903496814, 124.02825801299717, 124.09569231686116],
169
+ "short_factor": [0.9977997200264581, 1.014658295992452, 1.0349680404997148, 1.059429246056193, 1.0888815016813513, 1.1243301355211495, 1.166977103606075, 1.2182568066927284, 1.2798772354275727, 1.3538666751582975, 1.4426259039919596, 1.5489853358570191, 1.6762658237220625, 1.8283407612492941, 2.0096956085876183, 2.225478927469756, 2.481536379650452, 2.784415934557119, 3.1413289096347365, 3.560047844772632, 4.048719380066383, 4.752651957515948, 5.590913044973868, 6.584005926629993, 7.7532214876576155, 9.119754865903639, 10.704443927019176, 12.524994176518703, 14.59739595363613, 16.93214476166354, 19.53823297353041, 22.417131025031697, 25.568260840911098, 28.991144156566317, 32.68408069090375, 36.65174474170465, 40.90396065611201, 45.4664008671033, 50.37147343433591, 55.6804490772103, 61.470816952306556, 67.8622707390618, 75.00516023410414, 83.11898235973767, 92.50044360202462, 103.57086856690864, 116.9492274587385, 118.16074567836519, 119.18497548708795, 120.04810876261652, 120.77352815196981, 121.38182790207875, 121.89094985353891, 122.31638758099915, 122.6714244963338, 122.9673822552567, 123.21386397019609, 123.41898278254268, 123.58957065488238, 123.73136519024158, 123.84917421274221, 123.94701903496814, 124.02825801299717, 124.09569231686116],
170
+ "original_max_position_embeddings": 32768
171
+ }
172
+ }
173
+ ```
174
+
175
+ ### Inference with [SGLang](https://github.com/sgl-project/sglang)
176
+
177
+ For now, you need to install our forked version of SGLang.
178
+ ```bash
179
+ git clone -b openbmb https://github.com/OpenBMB/sglang.git
180
+ cd sglang
181
+
182
+ pip install --upgrade pip
183
+ pip install -e "python[all]"
184
+ ```
185
+
186
+ You can start the inference server by running the following command:
187
+ ```bash
188
+ python -m sglang.launch_server --model openbmb/MiniCPM4-8B --trust-remote-code --port 30000 --chat-template chatml
189
+ ```
190
+
191
+ Then you can use the chat interface by running the following command:
192
+ ```python
193
+ import openai
194
+
195
+ client = openai.Client(base_url=f"http://localhost:30000/v1", api_key="None")
196
+
197
+ response = client.chat.completions.create(
198
+ model="openbmb/MiniCPM4-8B",
199
+ messages=[
200
+ {"role": "user", "content": "Write an article about Artificial Intelligence."},
201
+ ],
202
+ temperature=0.7,
203
+ max_tokens=1024,
204
+ )
205
+
206
+ print(response.choices[0].message.content)
207
+ ```
208
 
209
  ### Inference with [vLLM](https://github.com/vllm-project/vllm)
210
+ For now, you need to install the latest version of vLLM.
211
+
212
+ ```bash
213
+ ```
214
 
215
  ## Evaluation Results
216
  On two typical end-side chips, Jetson AGX Orin and RTX 4090, MiniCPM4 demonstrates significantly faster processing speed compared to similar-size models in long text processing tasks. As text length increases, MiniCPM4's efficiency advantage becomes more pronounced. On the Jetson AGX Orin platform, compared to Qwen3-8B, MiniCPM4 achieves approximately 7x decoding speed improvement.
 
240
 
241
  ## Citation
242
 
243
+ - Please cite our [paper](https://github.com/OpenBMB/MiniCPM/tree/main/report/MiniCPM_4_Technical_Report.pdf) if you find our work valuable.
244
 
245
  ```bibtex
246
+ @article{minicpm4,
247
+ title={{MiniCPM4}: Ultra-Efficient LLMs on End Devices},
248
+ author={MiniCPM Team},
249
+ year={2025}
250
+ }
251
  ```