Improve model card: correct GitHub link, add project page, and fix HTML

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +1742 -254
README.md CHANGED
@@ -1,10 +1,11 @@
1
  ---
2
- pipeline_tag: image-text-to-text
3
  datasets:
4
  - openbmb/RLAIF-V-Dataset
5
- library_name: transformers
6
  language:
7
  - multilingual
 
 
 
8
  tags:
9
  - minicpm-v
10
  - vision
@@ -12,13 +13,11 @@ tags:
12
  - multi-image
13
  - video
14
  - custom_code
15
- license: apache-2.0
16
  ---
17
 
18
  <h1>A GPT-4o Level MLLM for Single Image, Multi Image and High-FPS Video Understanding on Your Phone</h1>
19
 
20
- [GitHub](https://github.com/OpenBMB/MiniCPM-o) | [CookBook](https://github.com/OpenSQZ/MiniCPM-V-CookBook) | [Technical Report](https://huggingface.co/papers/2509.18154) | [Demo](http://101.126.42.235:30910/) </a>
21
-
22
 
23
 
24
  ## MiniCPM-V 4.5
@@ -124,7 +123,7 @@ MiniCPM-V 4.5 can be easily used in various ways: (1) [llama.cpp](https://github
124
  <td><b>73.6</td>
125
  <td>2.63h</td>
126
  <td>32G</td>
127
- </tr>
128
  <tr>
129
  <td nowrap="nowrap" align="left">MiniCPM-V 4.5</td>
130
  <td>8.7B</td>
@@ -138,6 +137,7 @@ MiniCPM-V 4.5 can be easily used in various ways: (1) [llama.cpp](https://github
138
 
139
  Both Video-MME and OpenCompass were evaluated using 8×A100 GPUs for inference. The reported inference time of Video-MME includes full model-side computation, and excludes the external cost of video frame extraction (dependent on specific frame extraction tools) for fair comparison.
140
 
 
141
  ### Examples
142
 
143
  <div align="center">
@@ -150,6 +150,14 @@ Both Video-MME and OpenCompass were evaluated using 8×A100 GPUs for inference.
150
  <img src="https://raw.githubusercontent.com/openbmb/MiniCPM-o/main/assets/minicpmv4_5/en_case3.jpeg" alt="en_case3" style="margin-bottom: 5px;">
151
  </div>
152
 
 
 
 
 
 
 
 
 
153
  We deploy MiniCPM-V 4.5 on iPad M4 with [iOS demo](https://github.com/tc-mb/MiniCPM-o-demo-iOS). The demo video is the raw screen recording without editing.
154
 
155
  <div align="center">
@@ -162,230 +170,1273 @@ We deploy MiniCPM-V 4.5 on iPad M4 with [iOS demo](https://github.com/tc-mb/Mini
162
  <img src="https://raw.githubusercontent.com/openbmb/MiniCPM-o/main/assets/minicpmv4_5/v45_cn_travel.gif" width="45%" style="display: inline-block; margin: 0 10px;"/>
163
  </div>
164
 
165
- ## Framework Support Matrix
166
- <table>
167
- <thead>
168
- <tr>
169
- <th>Category</th>
170
- <th>Framework</th>
171
- <th>Cookbook Link</th>
172
- <th>Upstream PR</th>
173
- <th>Supported since(branch)</th>
174
- <th>Supported since(release)</th>
175
- </tr>
176
- </thead>
177
- <tbody>
178
- <tr>
179
- <td rowspan="2">Edge(On-device)</td>
180
- <td>Llama.cpp</td>
181
- <td><a href="https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/deployment/llama.cpp/minicpm-v4_5_llamacpp.md">Llama.cpp Doc</a></td>
182
- <td><a href="https://github.com/ggml-org/llama.cpp/pull/15575">#15575</a>(2025-08-26)</td>
183
- <td>master(2025-08-26)</td>
184
- <td><a href="https://github.com/ggml-org/llama.cpp/releases/tag/b6282">b6282</a></td>
185
- </tr>
186
- <tr>
187
- <td>Ollama</td>
188
- <td><a href="https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/deployment/ollama/minicpm-v4_5_ollama.md">Ollama Doc</a></td>
189
- <td><a href="https://github.com/ollama/ollama/pull/12078">#12078</a>(2025-08-26)</td>
190
- <td>Merging</td>
191
- <td>Waiting for official release</td>
192
- </tr>
193
- <tr>
194
- <td rowspan="2">Serving(Cloud)</td>
195
- <td>vLLM</td>
196
- <td><a href="https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/deployment/vllm/minicpm-v4_5_vllm.md">vLLM Doc</a></td>
197
- <td><a href="https://github.com/vllm-project/vllm/pull/23586">#23586</a>(2025-08-26)</td>
198
- <td>main(2025-08-27)</td>
199
- <td><a href="https://github.com/vllm-project/vllm/releases/tag/v0.10.2">v0.10.2</td>
200
- </tr>
201
- <tr>
202
- <td>SGLang</td>
203
- <td><a href="https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/deployment/sglang/MiniCPM-v4_5_sglang.md">SGLang Doc</a></td>
204
- <td><a href="https://github.com/sgl-project/sglang/pull/9610">#9610</a>(2025-08-26)</td>
205
- <td>Merging</td>
206
- <td>Waiting for official release</td>
207
- </tr>
208
- <tr>
209
- <td>Finetuning</td>
210
- <td>LLaMA-Factory</td>
211
- <td><a href="https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/finetune/finetune_llamafactory.md">LLaMA-Factory Doc</a></td>
212
- <td><a href="https://github.com/hiyouga/LLaMA-Factory/pull/9022">#9022</a>(2025-08-26)</td>
213
- <td>main(2025-08-26)</td>
214
- <td>Waiting for official release</td>
215
- </tr>
216
- <tr>
217
- <td rowspan="3">Quantization</td>
218
- <td>GGUF</td>
219
- <td><a href="https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/quantization/gguf/minicpm-v4_5_gguf_quantize.md">GGUF Doc</a></td>
220
- <td>—</td>
221
- <td>—</td>
222
- <td>—</td>
223
- </tr>
224
- <tr>
225
- <td>BNB</td>
226
- <td><a href="https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/quantization/bnb/minicpm-v4_5_bnb_quantize.md">BNB Doc</a></td>
227
- <td>—</td>
228
- <td>—</td>
229
- <td>—</td>
230
- </tr>
231
- <tr>
232
- <td>AWQ</td>
233
- <td><a href="https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/quantization/awq/minicpm-v4_5_awq_quantize.md">AWQ Doc</a></td>
234
- <td>—</td>
235
- <td>—</td>
236
- <td>—</td>
237
- </tr>
238
- <tr>
239
- <td>Demos</td>
240
- <td>Gradio Demo</td>
241
- <td><a href="https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/demo/web_demo/gradio/README.md">Gradio Demo Doc</a></td>
242
- <td>—</td>
243
- <td>—</td>
244
- <td>—</td>
245
- </tr>
246
- </tbody>
247
- </table>
248
-
249
- > Note: If you'd like us to prioritize support for another open-source framework, please let us know via this [short form](https://docs.google.com/forms/d/e/1FAIpQLSdyTUrOPBgWqPexs3ORrg47ZcZ1r4vFQaA4ve2iA7L9sMfMWw/viewform).
250
 
251
- ## Usage
252
 
253
- If you wish to enable thinking mode, provide the argument `enable_thinking=True` to the chat function.
 
254
 
255
- #### Chat with Image
256
- ```python
257
- import torch
258
- from PIL import Image
259
- from transformers import AutoModel, AutoTokenizer
260
 
261
- torch.manual_seed(100)
262
 
263
- model = AutoModel.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True, # or openbmb/MiniCPM-o-2_6
264
- attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
265
- model = model.eval().cuda()
266
- tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True) # or openbmb/MiniCPM-o-2_6
267
 
268
- image = Image.open('./assets/minicpmo2_6/show_demo.jpg').convert('RGB')
 
269
 
270
- enable_thinking=False # If `enable_thinking=True`, the thinking mode is enabled.
271
- stream=True # If `stream=True`, the answer is string
272
 
273
- # First round chat
274
- question = "What is the landform in the picture?"
275
- msgs = [{'role': 'user', 'content': [image, question]}]
276
 
277
- answer = model.chat(
278
- msgs=msgs,
279
- tokenizer=tokenizer,
280
- enable_thinking=enable_thinking,
281
- stream=True
282
- )
283
 
284
- generated_text = ""
285
- for new_text in answer:
286
- generated_text += new_text
287
- print(new_text, flush=True, end='')
288
 
289
- # Second round chat, pass history context of multi-turn conversation
290
- msgs.append({"role": "assistant", "content": [generated_text]})
291
- msgs.append({"role": "user", "content": ["What should I pay attention to when traveling here?"]})
292
 
293
- answer = model.chat(
294
- msgs=msgs,
295
- tokenizer=tokenizer,
296
- stream=True
297
- )
298
 
299
- generated_text = ""
300
- for new_text in answer:
301
- generated_text += new_text
302
- print(new_text, flush=True, end='')
303
- ```
304
 
305
- You will get the following output:
 
306
 
307
- ```shell
308
- # round1
309
- The landform in the picture is karst topography. Karst landscapes are characterized by distinctive, jagged limestone hills or mountains with steep, irregular peaks and deep valleys—exactly what you see here These unique formations result from the dissolution of soluble rocks like limestone over millions of years through water erosion.
310
 
311
- This scene closely resembles the famous karst landscape of Guilin and Yangshuo in China’s Guangxi Province. The area features dramatic, pointed limestone peaks rising dramatically above serene rivers and lush green forests, creating a breathtaking and iconic natural beauty that attracts millions of visitors each year for its picturesque views.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
312
 
313
- # round2
314
- When traveling to a karst landscape like this, here are some important tips:
315
 
316
- 1. Wear comfortable shoes: The terrain can be uneven and hilly.
317
- 2. Bring water and snacks for energy during hikes or boat rides.
318
- 3. Protect yourself from the sun with sunscreen, hats, and sunglasses—especially since you’ll likely spend time outdoors exploring scenic spots.
319
- 4. Respect local customs and nature regulations by not littering or disturbing wildlife.
320
 
321
- By following these guidelines, you'll have a safe and enjoyable trip while appreciating the stunning natural beauty of places such as Guilin’s karst mountains.
322
- ```
323
 
324
 
325
- #### Chat with Video
326
 
327
- ```python
328
- ## The 3d-resampler compresses multiple frames into 64 tokens by introducing temporal_ids.
329
- # To achieve this, you need to organize your video data into two corresponding sequences:
330
- # frames: List[Image]
331
- # temporal_ids: List[List[Int]].
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
332
 
333
- import torch
334
- from PIL import Image
335
- from transformers import AutoModel, AutoTokenizer
336
- from decord import VideoReader, cpu # pip install decord
337
- from scipy.spatial import cKDTree
338
- import numpy as np
339
- import math
340
 
341
- model = AutoModel.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True, # or openbmb/MiniCPM-o-2_6
342
- attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
343
- model = model.eval().cuda()
344
- tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True) # or openbmb/MiniCPM-o-2_6
345
 
346
- MAX_NUM_FRAMES=180 # Indicates the maximum number of frames received after the videos are packed. The actual maximum number of valid frames is MAX_NUM_FRAMES * MAX_NUM_PACKING.
347
- MAX_NUM_PACKING=3 # indicates the maximum packing number of video frames. valid range: 1-6
348
- TIME_SCALE = 0.1
349
 
350
- def map_to_nearest_scale(values, scale):
351
- tree = cKDTree(np.asarray(scale)[:, None])
352
- _, indices = tree.query(np.asarray(values)[:, None])
353
- return np.asarray(scale)[indices]
354
 
 
355
 
356
- def group_array(arr, size):
357
- return [arr[i:i+size] for i in range(0, len(arr), size)]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
358
 
359
- def encode_video(video_path, choose_fps=3, force_packing=None):
360
- def uniform_sample(l, n):
361
- gap = len(l) / n
362
- idxs = [int(i * gap + gap / 2) for i in range(n)]
363
- return [l[i] for i in idxs]
364
- vr = VideoReader(video_path, ctx=cpu(0))
365
- fps = vr.get_avg_fps()
366
- video_duration = len(vr) / fps
367
-
368
- if choose_fps * int(video_duration) <= MAX_NUM_FRAMES:
369
- packing_nums = 1
370
- choose_frames = round(min(choose_fps, round(fps)) * min(MAX_NUM_FRAMES, video_duration))
371
-
372
- else:
373
- packing_nums = math.ceil(video_duration * choose_fps / MAX_NUM_FRAMES)
374
- if packing_nums <= MAX_NUM_PACKING:
375
- choose_frames = round(video_duration * choose_fps)
376
- else:
377
- choose_frames = round(MAX_NUM_FRAMES * MAX_NUM_PACKING)
378
- packing_nums = MAX_NUM_PACKING
379
 
380
- frame_idx = [i for i in range(0, len(vr))]
381
- frame_idx = np.array(uniform_sample(frame_idx, choose_frames))
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
382
 
383
- if force_packing:
384
- packing_nums = min(force_packing, MAX_NUM_PACKING)
385
-
386
- print(video_path, ' duration:', video_duration)
387
- print(f'get video frames={len(frame_idx)}, packing_nums={packing_nums}')
388
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
389
  frames = vr.get_batch(frame_idx).asnumpy()
390
 
391
  frame_idx_ts = frame_idx / fps
@@ -422,107 +1473,544 @@ answer = model.chat(
422
  )
423
  print(answer)
424
  ```
 
 
 
 
 
 
425
 
426
- #### Chat with multiple images
427
- <details>
428
- <summary> Click to show Python code running MiniCPM-V 4.5 with multiple images input. </summary>
429
-
430
  ```python
431
  import torch
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
432
  from PIL import Image
 
 
 
 
 
433
  from transformers import AutoModel, AutoTokenizer
434
 
435
- model = AutoModel.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True,
436
- attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
437
  model = model.eval().cuda()
438
- tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True)
439
 
440
- image1 = Image.open('image1.jpg').convert('RGB')
441
- image2 = Image.open('image2.jpg').convert('RGB')
442
- question = 'Compare image 1 and image 2, tell me about the differences between image 1 and image 2.'
443
 
444
- msgs = [{'role': 'user', 'content': [image1, image2, question]}]
 
445
 
446
- answer = model.chat(
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
447
  msgs=msgs,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
448
  tokenizer=tokenizer
449
  )
450
- print(answer)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
451
  ```
 
452
  </details>
453
 
 
 
 
454
 
455
- #### In-context few-shot learning
456
  <details>
457
- <summary> Click to view Python code running MiniCPM-V 4.5 with few-shot input. </summary>
458
 
459
  ```python
 
460
  import torch
461
  from PIL import Image
462
  from transformers import AutoModel, AutoTokenizer
463
 
464
- model = AutoModel.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True,
465
- attn_implementation='sdpa', torch_dtype=torch.bfloat16)
466
- model = model.eval().cuda()
467
- tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True)
468
 
469
- question = "production date"
470
- image1 = Image.open('example1.jpg').convert('RGB')
471
- answer1 = "2023.08.04"
472
- image2 = Image.open('example2.jpg').convert('RGB')
473
- answer2 = "2007.04.24"
474
- image_test = Image.open('test.jpg').convert('RGB')
475
 
476
- msgs = [
477
- {'role': 'user', 'content': [image1, question]}, {'role': 'assistant', 'content': [answer1]},
478
- {'role': 'user', 'content': [image2, question]}, {'role': 'assistant', 'content': [answer2]},
479
- {'role': 'user', 'content': [image_test, question]}
480
- ]
481
 
482
- answer = model.chat(
 
483
  msgs=msgs,
484
- tokenizer=tokenizer
 
 
485
  )
486
  print(answer)
487
  ```
 
 
 
 
488
  </details>
489
 
490
 
491
- ## License
492
- #### Model License
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
493
  * The MiniCPM-o/V model weights and code are open-sourced under the [Apache-2.0](https://github.com/OpenBMB/MiniCPM-V/blob/main/LICENSE) license.
 
494
  * To help us better understand and support our users, we would deeply appreciate it if you could consider optionally filling out a brief registration ["questionnaire"](https://modelbest.feishu.cn/share/base/form/shrcnpV5ZT9EJ6xYjh3Kx0J6v8g).
495
 
496
- #### Statement
497
- * As an LMM, MiniCPM-V 4.5 generates contents by learning a large amount of multimodal corpora, but it cannot comprehend, express personal opinions or make value judgement. Anything generated by MiniCPM-V 4.5 does not represent the views and positions of the model developers
498
- * We will not be liable for any problems arising from the use of the MinCPM-V models, including but not limited to data security issues, risk of public opinion, or any risks and problems arising from the misdirection, misuse, dissemination or misuse of the model.
499
 
500
- ## Key Techniques and Other Multimodal Projects
501
 
502
- 👏 Welcome to explore key techniques of MiniCPM-V 4.5 and other multimodal projects of our team:
503
 
504
- [VisCPM](https://github.com/OpenBMB/VisCPM/tree/main) | [RLPR](https://github.com/OpenBMB/RLPR) | [RLHF-V](https://github.com/RLHF-V/RLHF-V) | [LLaVA-UHD](https://github.com/thunlp/LLaVA-UHD) | [RLAIF-V](https://github.com/RLHF-V/RLAIF-V)
505
 
506
- ## Citation
507
 
508
- If you find our work helpful, please consider citing our papers 📝 and liking this project ❤️!
509
 
510
- ```bib
511
- @misc{yu2025minicpmv45cookingefficient,
512
- title={MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe},
513
- author={Tianyu Yu and Zefan Wang and Chongyi Wang and Fuwei Huang and Wenshuo Ma and Zhihui He and Tianchi Cai and Weize Chen and Yuxiang Huang and Yuanqian Zhao and Bokai Xu and Junbo Cui and Yingjing Xu and Liqing Ruan and Luoyuan Zhang and Hanyu Liu and Jingkun Tang and Hongyuan Liu and Qining Guo and Wenhao Hu and Bingxiang He and Jie Zhou and Jie Cai and Ji Qi and Zonghao Guo and Chi Chen and Guoyang Zeng and Yuxuan Li and Ganqu Cui and Ning Ding and Xu Han and Yuan Yao and Zhiyuan Liu and Maosong Sun},
514
- year={2025},
515
- eprint={2509.18154},
516
- archivePrefix={arXiv},
517
- primaryClass={cs.LG},
518
- url={https://arxiv.org/abs/2509.18154},
519
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
520
 
 
 
 
 
 
 
 
 
 
 
 
 
521
  @article{yao2024minicpm,
522
  title={MiniCPM-V: A GPT-4V Level MLLM on Your Phone},
523
  author={Yao, Yuan and Yu, Tianyu and Zhang, Ao and Wang, Chongyi and Cui, Junbo and Zhu, Hongji and Cai, Tianchi and Li, Haoyu and Zhao, Weilin and He, Zhihui and others},
524
- journal={Nat Commun 16, 5509 (2025)},
525
- year={2025}
526
  }
527
-
528
  ```
 
1
  ---
 
2
  datasets:
3
  - openbmb/RLAIF-V-Dataset
 
4
  language:
5
  - multilingual
6
+ library_name: transformers
7
+ license: apache-2.0
8
+ pipeline_tag: image-text-to-text
9
  tags:
10
  - minicpm-v
11
  - vision
 
13
  - multi-image
14
  - video
15
  - custom_code
 
16
  ---
17
 
18
  <h1>A GPT-4o Level MLLM for Single Image, Multi Image and High-FPS Video Understanding on Your Phone</h1>
19
 
20
+ [GitHub](https://github.com/OpenBMB/MiniCPM-V) | [CookBook](https://github.com/OpenSQZ/MiniCPM-V-CookBook) | [Project Page](https://minicpm-o.readthedocs.io/en/latest/index.html) | [Technical Report](https://huggingface.co/papers/2509.18154) | [Demo](http://101.126.42.235:30910/)
 
21
 
22
 
23
  ## MiniCPM-V 4.5
 
123
  <td><b>73.6</td>
124
  <td>2.63h</td>
125
  <td>32G</td>
126
+ </tr>
127
  <tr>
128
  <td nowrap="nowrap" align="left">MiniCPM-V 4.5</td>
129
  <td>8.7B</td>
 
137
 
138
  Both Video-MME and OpenCompass were evaluated using 8×A100 GPUs for inference. The reported inference time of Video-MME includes full model-side computation, and excludes the external cost of video frame extraction (dependent on specific frame extraction tools) for fair comparison.
139
 
140
+
141
  ### Examples
142
 
143
  <div align="center">
 
150
  <img src="https://raw.githubusercontent.com/openbmb/MiniCPM-o/main/assets/minicpmv4_5/en_case3.jpeg" alt="en_case3" style="margin-bottom: 5px;">
151
  </div>
152
 
153
+ <details>
154
+ <summary>Click to view more cases.</summary>
155
+ <div style="display: flex; flex-direction: column; align-items: center;">
156
+ <img src="https://raw.githubusercontent.com/openbmb/MiniCPM-o/main/assets/minicpmv4_5/zh_extra.jpeg" alt="zh_extra" style="margin-bottom: 5px;">
157
+ </div>
158
+
159
+ </details>
160
+
161
  We deploy MiniCPM-V 4.5 on iPad M4 with [iOS demo](https://github.com/tc-mb/MiniCPM-o-demo-iOS). The demo video is the raw screen recording without editing.
162
 
163
  <div align="center">
 
170
  <img src="https://raw.githubusercontent.com/openbmb/MiniCPM-o/main/assets/minicpmv4_5/v45_cn_travel.gif" width="45%" style="display: inline-block; margin: 0 10px;"/>
171
  </div>
172
 
173
+ ## MiniCPM-o 2.6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
174
 
175
+ **MiniCPM-o 2.6** is the latest and most capable model in the MiniCPM-o series. The model is built in an end-to-end fashion based on SigLip-400M, Whisper-medium-300M, ChatTTS-200M, and Qwen2.5-7B with a total of 8B parameters. It exhibits a significant performance improvement over MiniCPM-V 2.6, and introduces new features for real-time speech conversation and multimodal live streaming. Notable features of MiniCPM-o 2.6 include:
176
 
177
+ - 🔥 **Leading Visual Capability.**
178
+ MiniCPM-o 2.6 achieves an average score of 70.2 on OpenCompass, a comprehensive evaluation of 8 popular benchmarks. **With only 8B parameters, it surpasses widely used proprietary models like GPT-4o-202405, Gemini 1.5 Pro, and Claude 3.5 Sonnet** for single image understanding. It also **outperforms GPT-4V and Claude 3.5 Sonnet** in multi-image and video understanding, and shows promising in-context learning capability.
179
 
180
+ - 🎙 **State-of-the-art Speech Capability.** MiniCPM-o 2.6 supports **bilingual real-time speech conversation with configurable voices** in English and Chinese. It **outperforms GPT-4o-realtime on audio understanding tasks** such as ASR and STT translation, and shows **state-of-the-art performance on speech conversation in both semantic and acoustic evaluations in the open-source community**. It also allows for fun features such as emotion/speed/style control, end-to-end voice cloning, role play, etc.
 
 
 
 
181
 
182
+ - 🎬 **Strong Multimodal Live Streaming Capability.** As a new feature, MiniCPM-o 2.6 can **accept continuous video and audio streams independent of user queries, and support real-time speech interaction**. It **outperforms GPT-4o-202408 and Claude 3.5 Sonnet and shows state-of-the-art performance in the open-source community on StreamingBench**, a comprehensive benchmark for real-time video understanding, omni-source (video & audio) understanding, and multimodal contextual understanding.
183
 
184
+ - 💪 **Strong OCR Capability and Others.**
185
+ Advancing popular visual capabilities from MiniCPM-V series, MiniCPM-o 2.6 can process images with any aspect ratio and up to 1.8 million pixels (e.g., 1344x1344). It achieves **state-of-the-art performance on OCRBench for models under 25B, surpassing proprietary models such as GPT-4o-202405**.
186
+ Based on the latest [RLAIF-V](https://github.com/RLHF-V/RLAIF-V/) and [VisCPM](https://github.com/OpenBMB/VisCPM) techniques, it features **trustworthy behaviors**, outperforming GPT-4o and Claude 3.5 Sonnet on MMHal-Bench, and supports **multilingual capabilities** on more than 30 languages.
 
187
 
188
+ - 🚀 **Superior Efficiency.**
189
+ In addition to its friendly size, MiniCPM-o 2.6 also shows **state-of-the-art token density** (i.e., the number of pixels encoded into each visual token). **It produces only 640 tokens when processing a 1.8M pixel image, which is 75% fewer than most models**. This directly improves the inference speed, first-token latency, memory usage, and power consumption. As a result, MiniCPM-o 2.6 can efficiently support **multimodal live streaming** on end-side devices such as iPads.
190
 
191
+ - 💫 **Easy Usage.**
192
+ MiniCPM-o 2.6 can be easily used in various ways: (1) [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-omni/examples/llava/README-minicpmo2.6.md) support for efficient CPU inference on local devices, (2) [int4](https://huggingface.co/openbmb/MiniCPM-o-2_6-int4) and [GGUF](https://huggingface.co/openbmb/MiniCPM-o-2_6-gguf) format quantized models in 16 sizes, (3) [vLLM](#efficient-inference-with-llamacpp-ollama-vllm) support for high-throughput and memory-efficient inference, (4) fine-tuning on new domains and tasks with [LLaMA-Factory](./docs/llamafactory_train_and_infer.md), (5) quick [local WebUI demo](#chat-with-our-demo-on-gradio), and (6) online web demo on [server](https://minicpm-omni-webdemo-us.modelbest.cn/).
193
 
194
+ **Model Architecture.**
 
 
195
 
196
+ - **End-to-end Omni-modal Architecture.** Different modality encoders/decoders are connected and trained in an **end-to-end** fashion to fully exploit rich multimodal knowledge. The model is trained in a fully end-to-end manner with only CE loss.
197
+ - **Omni-modal Live Streaming Mechanism.** (1) We change the offline modality encoder/decoders into online ones for **streaming inputs/outputs.** (2) We devise a **time-division multiplexing (TDM) mechanism** for omni-modality streaming processing in the LLM backbone. It divides parallel omni-modality streams into sequential info within small periodic time slices.
198
+ - **Configurable Speech Modeling Design.** We devise a multimodal system prompt, including traditional text system prompt, and **a new audio system prompt to determine the assistant voice**. This enables flexible voice configurations in inference time, and also facilitates end-to-end voice cloning and description-based voice creation.
 
 
 
199
 
200
+ <div align="center">
201
+ <img src="https://raw.githubusercontent.com/openbmb/MiniCPM-o/main/assets/minicpm-o-26-framework-v2.png" , width=80%>
202
+ </div>
 
203
 
 
 
 
204
 
205
+ ### Evaluation <!-- omit in toc -->
 
 
 
 
206
 
207
+ <div align="center">
208
+ <img src="https://raw.githubusercontent.com/openbmb/MiniCPM-o/main/assets/radar.jpg", width=80%>
209
+ </div>
 
 
210
 
211
+ <details>
212
+ <summary>Click to view visual understanding results.</summary>
213
 
214
+ **Image Understanding**
 
 
215
 
216
+ <div align="center">
217
+ <table style="margin: 0px auto;">
218
+ <thead>
219
+ <tr>
220
+ <th align="left">Model</th>
221
+ <th>Size</th>
222
+ <th>Token Density<sup>+</sup></th>
223
+ <th>OpenCompass</th>
224
+ <th>OCRBench</th>
225
+ <th>MathVista mini</th>
226
+ <th>ChartQA</th>
227
+ <th>MMVet</th>
228
+ <th>MMStar</th>
229
+ <th>MME</th>
230
+ <th>MMB1.1 test</th>
231
+ <th>AI2D</th>
232
+ <th>MMMU val</th>
233
+ <th>HallusionBench</th>
234
+ <th>TextVQA val</th>
235
+ <th>DocVQA test</th>
236
+ <th>MathVerse mini</th>
237
+ <th>MathVision</th>
238
+ <th>MMHal Score</th>
239
+ </tr>
240
+ </thead>
241
+ <tbody align="center">
242
+ <tr>
243
+ <td colspan="19" align="left"><strong>Proprietary</strong></td>
244
+ </tr>
245
+ <tr>
246
+ <td nowrap="nowrap" align="left">GPT-4o-20240513</td>
247
+ <td>-</td>
248
+ <td>1088</td>
249
+ <td><u>69.9</u></td>
250
+ <td>736</td>
251
+ <td>61.3</td>
252
+ <td>85.7</td>
253
+ <td><strong>69.1</strong></td>
254
+ <td>63.9</td>
255
+ <td>2328.7</td>
256
+ <td>82.2</td>
257
+ <td>84.6</td>
258
+ <td><strong>69.2</strong></td>
259
+ <td><strong>55.0</strong></td>
260
+ <td>-</td>
261
+ <td>92.8</td>
262
+ <td><strong>50.2</strong></td>
263
+ <td><strong>30.4</strong></td>
264
+ <td><u>3.6</u></td>
265
+ </tr>
266
+ <tr>
267
+ <td nowrap="nowrap" align="left">Claude3.5-Sonnet</td>
268
+ <td>-</td>
269
+ <td>750</td>
270
+ <td>67.9</td>
271
+ <td>788</td>
272
+ <td>61.6</td>
273
+ <td><strong>90.8</strong></td>
274
+ <td>66.0</td>
275
+ <td>62.2</td>
276
+ <td>1920.0</td>
277
+ <td>78.5</td>
278
+ <td>80.2</td>
279
+ <td><u>65.9</u></td>
280
+ <td>49.9</td>
281
+ <td>-</td>
282
+ <td><strong>95.2</strong></td>
283
+ <td>-</td>
284
+ <td>-</td>
285
+ <td>3.4</td>
286
+ </tr>
287
+ <tr>
288
+ <td nowrap="nowrap" align="left">Gemini 1.5 Pro</td>
289
+ <td>-</td>
290
+ <td>-</td>
291
+ <td>64.4</td>
292
+ <td>754</td>
293
+ <td>57.7</td>
294
+ <td>81.3</td>
295
+ <td>64.0</td>
296
+ <td>59.1</td>
297
+ <td>2110.6</td>
298
+ <td>73.9</td>
299
+ <td>79.1</td>
300
+ <td>60.6</td>
301
+ <td>45.6</td>
302
+ <td>73.5</td>
303
+ <td>86.5</td>
304
+ <td>-</td>
305
+ <td>19.2</td>
306
+ <td>-</td>
307
+ </tr>
308
+ <tr>
309
+ <td nowrap="nowrap" align="left">GPT-4o-mini-20240718</td>
310
+ <td>-</td>
311
+ <td>1088</td>
312
+ <td>64.1</td>
313
+ <td>785</td>
314
+ <td>52.4</td>
315
+ <td>-</td>
316
+ <td>66.9</td>
317
+ <td>54.8</td>
318
+ <td>2003.4</td>
319
+ <td>76.0</td>
320
+ <td>77.8</td>
321
+ <td>60.0</td>
322
+ <td>46.1</td>
323
+ <td>-</td>
324
+ <td>-</td>
325
+ <td>-</td>
326
+ <td>-</td>
327
+ <td>3.3</td>
328
+ </tr>
329
+ <tr>
330
+ <td colspan="19" align="left"><strong>Open Source</strong></td>
331
+ </tr>
332
+ <tr>
333
+ <td nowrap="nowrap" align="left">Cambrian-34B</td>
334
+ <td>34B</td>
335
+ <td><u>1820</u></td>
336
+ <td>58.3</td>
337
+ <td>591</td>
338
+ <td>50.3</td>
339
+ <td>75.6</td>
340
+ <td>53.2</td>
341
+ <td>54.2</td>
342
+ <td>2049.9</td>
343
+ <td>77.8</td>
344
+ <td>79.5</td>
345
+ <td>50.4</td>
346
+ <td>41.6</td>
347
+ <td>76.7</td>
348
+ <td>75.5</td>
349
+ <td>-</td>
350
+ <td>-</td>
351
+ <td>-</td>
352
+ </tr>
353
+ <tr>
354
+ <td nowrap="nowrap" align="left">GLM-4V-9B</td>
355
+ <td>13B</td>
356
+ <td>784</td>
357
+ <td>59.1</td>
358
+ <td>776</td>
359
+ <td>51.1</td>
360
+ <td>-</td>
361
+ <td>58.0</td>
362
+ <td>54.8</td>
363
+ <td>2018.8</td>
364
+ <td>67.9</td>
365
+ <td>71.2</td>
366
+ <td>46.9</td>
367
+ <td>45.0</td>
368
+ <td>-</td>
369
+ <td>-</td>
370
+ <td>-</td>
371
+ <td>-</td>
372
+ <td>-</td>
373
+ </tr>
374
+ <tr>
375
+ <td nowrap="nowrap" align="left">Pixtral-12B</td>
376
+ <td>12B</td>
377
+ <td>256</td>
378
+ <td>61.0</td>
379
+ <td>685</td>
380
+ <td>56.9</td>
381
+ <td>81.8</td>
382
+ <td>58.5</td>
383
+ <td>54.5</td>
384
+ <td>-</td>
385
+ <td>72.7</td>
386
+ <td>79.0</td>
387
+ <td>51.1</td>
388
+ <td>47.0</td>
389
+ <td>75.7</td>
390
+ <td>90.7</td>
391
+ <td>-</td>
392
+ <td>-</td>
393
+ <td>-</td>
394
+ </tr>
395
+ <tr>
396
+ <td nowrap="nowrap" align="left">VITA-1.5</td>
397
+ <td>8B</td>
398
+ <td>784</td>
399
+ <td>63.3</td>
400
+ <td>741</td>
401
+ <td>66.2</td>
402
+ <td>-</td>
403
+ <td>52.7</td>
404
+ <td>60.2</td>
405
+ <td>2328.1</td>
406
+ <td>76.8</td>
407
+ <td>79.2</td>
408
+ <td>52.6</td>
409
+ <td>44.6</td>
410
+ <td>-</td>
411
+ <td>-</td>
412
+ <td>-</td>
413
+ <td>-</td>
414
+ <td>-</td>
415
+ </tr>
416
+ <tr>
417
+ <td nowrap="nowrap" align="left">DeepSeek-VL2-27B (4B)</td>
418
+ <td>27B</td>
419
+ <td>672</td>
420
+ <td>66.4</td>
421
+ <td>809</td>
422
+ <td>63.9</td>
423
+ <td>86.0</td>
424
+ <td>60.0</td>
425
+ <td>61.9</td>
426
+ <td>2253.0</td>
427
+ <td>81.2</td>
428
+ <td>83.8</td>
429
+ <td>54.0</td>
430
+ <td>45.3</td>
431
+ <td><u>84.2</u></td>
432
+ <td>93.3</td>
433
+ <td>-</td>
434
+ <td>-</td>
435
+ <td>3.0</td>
436
+ </tr>
437
+ <tr>
438
+ <td nowrap="nowrap" align="left">Qwen2-VL-7B</td>
439
+ <td>8B</td>
440
+ <td>784</td>
441
+ <td>67.1</td>
442
+ <td><u>866</u></td>
443
+ <td>58.2</td>
444
+ <td>83.0</td>
445
+ <td>62.0</td>
446
+ <td>60.7</td>
447
+ <td>2326.0</td>
448
+ <td>81.8</td>
449
+ <td>83.0</td>
450
+ <td>54.1</td>
451
+ <td>50.6</td>
452
+ <td><strong>84.3</strong></td>
453
+ <td><u>94.5</u></td>
454
+ <td>31.9</td>
455
+ <td>16.3</td>
456
+ <td>3.2</td>
457
+ </tr>
458
+ <tr>
459
+ <td nowrap="nowrap" align="left">LLaVA-OneVision-72B</td>
460
+ <td>72B</td>
461
+ <td>182</td>
462
+ <td>68.1</td>
463
+ <td>741</td>
464
+ <td>67.5</td>
465
+ <td>83.7</td>
466
+ <td>60.6</td>
467
+ <td><strong>65.8</strong></td>
468
+ <td>2261.0</td>
469
+ <td><strong>85.0</strong></td>
470
+ <td><u>85.6</u></td>
471
+ <td>56.8</td>
472
+ <td>49.0</td>
473
+ <td>80.5</td>
474
+ <td>91.3</td>
475
+ <td>39.1</td>
476
+ <td>-</td>
477
+ <td>3.5</td>
478
+ </tr>
479
+ <tr>
480
+ <td nowrap="nowrap" align="left">InternVL2.5-8B</td>
481
+ <td>8B</td>
482
+ <td>706</td>
483
+ <td>68.3</td>
484
+ <td>822</td>
485
+ <td><u>64.4</u></td>
486
+ <td>84.8</td>
487
+ <td>62.8</td>
488
+ <td>62.8</td>
489
+ <td>2344.0</td>
490
+ <td><u>83.6</u></td>
491
+ <td>84.5</td>
492
+ <td>56.0</td>
493
+ <td>50.1</td>
494
+ <td>79.1</td>
495
+ <td>93.0</td>
496
+ <td>39.5</td>
497
+ <td>19.7</td>
498
+ <td>3.4</td>
499
+ </tr>
500
+ <tr>
501
+ <td nowrap="nowrap" align="left">MiniCPM-V 2.6</td>
502
+ <td>8B</td>
503
+ <td><strong>2822</strong></td>
504
+ <td>65.2</td>
505
+ <td>852*</td>
506
+ <td>60.6</td>
507
+ <td>79.4</td>
508
+ <td>60.0</td>
509
+ <td>57.5</td>
510
+ <td><u>2348.4*</u></td>
511
+ <td>78.0</td>
512
+ <td>82.1</td>
513
+ <td>49.8*</td>
514
+ <td>48.1*</td>
515
+ <td>80.1</td>
516
+ <td>90.8</td>
517
+ <td>25.7</td>
518
+ <td>18.3</td>
519
+ <td>3.6</td>
520
+ </tr>
521
+ <tr>
522
+ <td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
523
+ <td>8B</td>
524
+ <td><strong>2822</strong></td>
525
+ <td><strong>70.2</strong></td>
526
+ <td><strong>897*</strong></td>
527
+ <td><strong>71.9*</strong></td>
528
+ <td><u>86.9*</u></td>
529
+ <td><u>67.5</u></td>
530
+ <td><u>64.0</u></td>
531
+ <td><strong>2372.0*</strong></td>
532
+ <td>80.5</td>
533
+ <td><strong>85.8</strong></td>
534
+ <td>50.4*</td>
535
+ <td><u>51.9</u></td>
536
+ <td>82.0</td>
537
+ <td>93.5</td>
538
+ <td><u>41.4*</u></td>
539
+ <td><u>23.1*</u></td>
540
+ <td><strong>3.8</strong></td>
541
+ </tr>
542
+ </tbody>
543
+ </table>
544
+ </div>
545
+ * We evaluate this benchmark using chain-of-thought prompting. Specifically, for MME, we used this technique only for the Cognition set.
546
 
 
 
547
 
548
+ <sup>+</sup> Token Density: number of pixels encoded into each visual token at maximum resolution, i.e., # pixels at maximum resolution / # visual tokens.
 
 
 
549
 
550
+ Note: For proprietary models, we calculate token density based on the image encoding charging strategy defined in the official API documentation, which provides an upper-bound estimation.
 
551
 
552
 
553
+ **Multi-image and Video Understanding**
554
 
555
+ <div align="center">
556
+
557
+ <table style="margin: 0px auto;">
558
+ <thead>
559
+ <tr>
560
+ <th align="left">Model</th>
561
+ <th>Size</th>
562
+ <th>BLINK val</th>
563
+ <th>Mantis Eval</th>
564
+ <th>MIRB</th>
565
+ <th>Video-MME (wo / w subs)</th>
566
+ </tr>
567
+ </thead>
568
+ <tbody align="center">
569
+ <tr>
570
+ <td colspan="6" align="left"><strong>Proprietary</strong></td>
571
+ </tr>
572
+ <tr>
573
+ <td nowrap="nowrap" align="left">GPT-4o-20240513</td>
574
+ <td>-</td>
575
+ <td><strong>68.0</strong></td>
576
+ <td>-</td>
577
+ <td>-</td>
578
+ <td><strong>71.9/77.2<strong></td>
579
+ </tr>
580
+ <tr>
581
+ <td nowrap="nowrap" align="left">GPT4V</td>
582
+ <td>-</td>
583
+ <td>54.6</td>
584
+ <td>62.7</td>
585
+ <td>53.1</td>
586
+ <td>59.9/63.3</td>
587
+ </tr>
588
+ <tr>
589
+ <td colspan="6" align="left"><strong>Open-source</strong></td>
590
+ </tr>
591
+ <tr>
592
+ <td nowrap="nowrap" align="left">VITA-1.5</td>
593
+ <td>8B</td>
594
+ <td>45.0</td>
595
+ <td>-</td>
596
+ <td>-</td>
597
+ <td>56.1/58.7</td>
598
+ </tr>
599
+ <tr>
600
+ <td nowrap="nowrap" align="left">LLaVA-NeXT-Interleave 14B</td>
601
+ <td>14B</td>
602
+ <td>52.6</td>
603
+ <td>66.4</td>
604
+ <td>30.2</td>
605
+ <td>-</td>
606
+ </tr>
607
+ <tr>
608
+ <td nowrap="nowrap" align="left">LLaVA-OneVision-72B</td>
609
+ <td>72B</td>
610
+ <td>55.4</td>
611
+ <td><strong>77.6</strong></td>
612
+ <td>-</td>
613
+ <td><u>66.2/69.5</u></td>
614
+ </tr>
615
+ <tr>
616
+ <td nowrap="nowrap" align="left">MANTIS 8B</td>
617
+ <td>8B</td>
618
+ <td>49.1</td>
619
+ <td>59.5</td>
620
+ <td>34.8</td>
621
+ <td>-</td>
622
+ </tr>
623
+ <tr>
624
+ <td nowrap="nowrap" align="left">Qwen2-VL-7B</td>
625
+ <td>8B</td>
626
+ <td>53.2</td>
627
+ <td>69.6*</td>
628
+ <td><strong>67.6*</strong></td>
629
+ <td>63.3/69.0</td>
630
+ </tr>
631
+ <tr>
632
+ <td nowrap="nowrap" align="left">InternVL2.5-8B</td>
633
+ <td>8B</td>
634
+ <td>54.8</td>
635
+ <td>67.7</td>
636
+ <td>52.5</td>
637
+ <td>64.2/66.9</td>
638
+ </tr>
639
+ <tr>
640
+ <td nowrap="nowrap" align="left">MiniCPM-V 2.6</td>
641
+ <td>8B</td>
642
+ <td>53.0</td>
643
+ <td>69.1</td>
644
+ <td>53.8</td>
645
+ <td>60.9/63.6</td>
646
+ </tr>
647
+ <tr>
648
+ <td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
649
+ <td>8B</td>
650
+ <td><u>56.7</u></td>
651
+ <td><u>71.9</u></td>
652
+ <td><u>58.6</u></td>
653
+ <td>63.9/67.9</td>
654
+ </tr>
655
+ </tbody>
656
+ </table>
657
 
658
+ </div>
659
+ * We evaluate officially released checkpoints by ourselves.
 
 
 
 
 
660
 
661
+ </details>
 
 
 
662
 
 
 
 
663
 
664
+ <details>
665
+ <summary>Click to view audio understanding and speech conversation results.</summary>
 
 
666
 
667
+ **Audio Understanding**
668
 
669
+ <div align="center">
670
+ <table style="margin: 0px auto;">
671
+ <thead>
672
+ <tr>
673
+ <th align="left">Task</th>
674
+ <th>Size</th>
675
+ <th colspan="3">ASR (zh)</th>
676
+ <th colspan="3">ASR (en)</th>
677
+ <th colspan="2">AST</th>
678
+ <th>Emotion</th>
679
+ </tr>
680
+ <tr>
681
+ <th align="left">Metric</th>
682
+ <td></td>
683
+ <th colspan="3">CER↓</th>
684
+ <th colspan="3">WER↓</th>
685
+ <th colspan="2">BLEU↑</th>
686
+ <th>ACC↑</th>
687
+ </tr>
688
+ <tr>
689
+ <th align="left">Dataset</th>
690
+ <td></td>
691
+ <th>AISHELL-1</th>
692
+ <th>Fleurs zh</th>
693
+ <th>WenetSpeech test-net</th>
694
+ <th>LibriSpeech test-clean</th>
695
+ <th>GigaSpeech</th>
696
+ <th>TED-LIUM</th>
697
+ <th>CoVoST en2zh</th>
698
+ <th>CoVoST zh2en</th>
699
+ <th>MELD emotion</th>
700
+ </tr>
701
+ </thead>
702
+ <tbody align="center">
703
+ <tr>
704
+ <td colspan="11" align="left"><strong>Proprietary</strong></td>
705
+ </tr>
706
+ <tr>
707
+ <td nowrap="nowrap" align="left">GPT-4o-Realtime</td>
708
+ <td>-</td>
709
+ <td>7.3*</td>
710
+ <td><u>5.4*</u></td>
711
+ <td>28.9*</td>
712
+ <td>2.6*</td>
713
+ <td>12.9*</td>
714
+ <td>4.8*</td>
715
+ <td>37.1*</td>
716
+ <td>15.7*</td>
717
+ <td>33.2*</td>
718
+ </tr>
719
+ <tr>
720
+ <td nowrap="nowrap" align="left">Gemini 1.5 Pro</td>
721
+ <td>-</td>
722
+ <td>4.5*</td>
723
+ <td>5.9*</td>
724
+ <td>14.3*</td>
725
+ <td>2.9*</td>
726
+ <td>10.6*</td>
727
+ <td><strong>3.0*</strong></td>
728
+ <td><u>47.3*</u></td>
729
+ <td>22.6*</td>
730
+ <td>48.4*</td>
731
+ </tr>
732
+ <tr>
733
+ <td colspan="11" align="left"><strong>Open-Source</strong></td>
734
+ </tr>
735
+ <tr>
736
+ <td nowrap="nowrap" align="left">Qwen2-Audio-7B</td>
737
+ <td>8B</td>
738
+ <td>-</td>
739
+ <td>7.5</td>
740
+ <td>-</td>
741
+ <td><strong>1.6</strong></td>
742
+ <td>-</td>
743
+ <td>-</td>
744
+ <td>45.2</td>
745
+ <td><u>24.4</u></td>
746
+ <td><strong>55.3</strong></td>
747
+ </tr>
748
+ <tr>
749
+ <td nowrap="nowrap" align="left">Qwen2-Audio-7B-Instruct</td>
750
+ <td>8B</td>
751
+ <td>2.6*</td>
752
+ <td>6.9*</td>
753
+ <td><u>10.3*</u></td>
754
+ <td>3.1*</td>
755
+ <td><u>9.7</u>*</td>
756
+ <td>5.9*</td>
757
+ <td>39.5*</td>
758
+ <td>22.9*</td>
759
+ <td>17.4*</td>
760
+ </tr>
761
+ <tr>
762
+ <td nowrap="nowrap" align="left">VITA-1.5</td>
763
+ <td>8B</td>
764
+ <td>2.16</td>
765
+ <td>-</td>
766
+ <td>8.4</td>
767
+ <td>3.4</td>
768
+ <td>-</td>
769
+ <td>-</td>
770
+ <td>-</td>
771
+ <td>-</td>
772
+ <td>-</td>
773
+ </tr>
774
+ <tr>
775
+ <td nowrap="nowrap" align="left">GLM-4-Voice-Base</td>
776
+ <td>9B</td>
777
+ <td><u>2.5</u></td>
778
+ <td>-</td>
779
+ <td>-</td>
780
+ <td>2.8</td>
781
+ <td>-</td>
782
+ <td>-</td>
783
+ <td>-</td>
784
+ <td>-</td>
785
+ </tr>
786
+ <tr>
787
+ <td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
788
+ <td>8B</td>
789
+ <td><strong>1.6</strong></td>
790
+ <td><strong>4.4</strong></td>
791
+ <td><strong>6.9</strong></td>
792
+ <td><u>1.7</u></td>
793
+ <td><strong>8.7</strong></td>
794
+ <td><strong>3.0</strong></td>
795
+ <td><strong>48.2</strong></td>
796
+ <td><strong>27.2</strong></td>
797
+ <td><u>52.4</u></td>
798
+ </tr>
799
+ </tbody>
800
+ </table>
801
+ </div>
802
+ * We evaluate officially released checkpoints by ourselves.<br><br>
803
 
804
+ **Speech Generation**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
805
 
806
+ <div align="center">
807
+ <table style="margin: 0px auto;">
808
+ <thead>
809
+ <tr>
810
+ <th align="left">Task</th>
811
+ <th>Size</th>
812
+ <th colspan="9">SpeechQA</th>
813
+ </tr>
814
+ <tr>
815
+ <th align="left">Metric</th>
816
+ <th></th>
817
+ <th colspan="3">ACC↑</th>
818
+ <th>G-Eval (10 point)↑</th>
819
+ <th>Semantic ELO score↑</th>
820
+ <th>Acoustic ELO score↑</th>
821
+ <th>Overall ELO score↑</th>
822
+ <th>UTMOS↑</th>
823
+ <th>ASR-WER↓</th>
824
+ </tr>
825
+ <tr>
826
+ <th align="left">Dataset</th>
827
+ <th></th>
828
+ <th>Speech Llama Q.</th>
829
+ <th>Speech Web Q.</th>
830
+ <th>Speech Trivia QA</th>
831
+ <th>Speech AlpacaEval</th>
832
+ <th colspan="5">AudioArena</th>
833
+ </tr>
834
+ </thead>
835
+ <tbody align="center">
836
+ <tr>
837
+ <td colspan="11" align="left"><strong>Proprietary</strong></td>
838
+ </tr>
839
+ <tr>
840
+ <td nowrap="nowrap" align="left">GPT-4o-Realtime</td>
841
+ <td></td>
842
+ <td><strong>71.7</strong></td>
843
+ <td><strong>51.6</strong></td>
844
+ <td><strong>69.7</strong></td>
845
+ <td><strong>7.4</strong></td>
846
+ <td><strong>1157</strong></td>
847
+ <td><strong>1203</strong></td>
848
+ <td><strong>1200</strong></td>
849
+ <td><strong>4.2</strong></td>
850
+ <td><strong>2.3</strong></td>
851
+ </tr>
852
+ <tr>
853
+ <td colspan="11" align="left"><strong>Open-Source</strong></td>
854
+ </tr>
855
+ <tr>
856
+ <td nowrap="nowrap" align="left">GLM-4-Voice</td>
857
+ <td>9B</td>
858
+ <td>50.0</td>
859
+ <td>32.0</td>
860
+ <td>36.4</td>
861
+ <td><u>5.1</u></td>
862
+ <td>999</td>
863
+ <td>1147</td>
864
+ <td>1035</td>
865
+ <td><u>4.1</u></td>
866
+ <td><u>11.7</u></td>
867
+ </tr>
868
+ <tr>
869
+ <td nowrap="nowrap" align="left">Llama-Omni</td>
870
+ <td>8B</td>
871
+ <td>45.3</td>
872
+ <td>22.9</td>
873
+ <td>10.7</td>
874
+ <td>3.9</td>
875
+ <td>960</td>
876
+ <td>878</td>
877
+ <td>897</td>
878
+ <td>3.2</td>
879
+ <td>24.3</td>
880
+ </tr>
881
+ <tr>
882
+ <td nowrap="nowrap" align="left">VITA-1.5</td>
883
+ <td>8B</td>
884
+ <td>46.7</td>
885
+ <td>28.1</td>
886
+ <td>23.3</td>
887
+ <td>2.0</td>
888
+ <td>-</td>
889
+ <td>-</td>
890
+ <td>-</td>
891
+ <td>-</td>
892
+ <td>-</td>
893
+ </tr>
894
+ <tr>
895
+ <td nowrap="nowrap" align="left">Moshi</td>
896
+ <td>7B</td>
897
+ <td>43.7</td>
898
+ <td>23.8</td>
899
+ <td>16.7</td>
900
+ <td>2.4</td>
901
+ <td>871</td>
902
+ <td>808</td>
903
+ <td>875</td>
904
+ <td>2.8</td>
905
+ <td>8.2</td>
906
+ </tr>
907
+ <tr>
908
+ <td nowrap="nowrap" align="left">Mini-Omni</td>
909
+ <td>1B</td>
910
+ <td>22.0</td>
911
+ <td>12.8</td>
912
+ <td>6.9</td>
913
+ <td>2.5</td>
914
+ <td>926</td>
915
+ <td>803</td>
916
+ <td>865</td>
917
+ <td>3.4</td>
918
+ <td>10.0</td>
919
+ </tr>
920
+ <tr>
921
+ <td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
922
+ <td>8B</td>
923
+ <td><u>61.0</u></td>
924
+ <td><u>40.0</u></td>
925
+ <td><u>40.2</u></td>
926
+ <td><u>5.1</u></td>
927
+ <td><u>1088</u></td>
928
+ <td><u>1163</u></td>
929
+ <td><u>1131</u></td>
930
+ <td><strong>4.2</strong></td>
931
+ <td>9.8</td>
932
+ </tr>
933
+ </tbody>
934
+ </table>
935
+ </div>
936
+ All results are from AudioEvals, and the evaluation methods along with further details can be found in <a href="https://github.com/OpenBMB/UltraEval-Audio" target="_blank">AudioEvals</a>.<br><br>
937
 
938
+ **End-to-end Voice Cloning**
939
+
940
+ <div align="center">
941
+ <table style="margin: 0px auto;">
942
+ <thead>
943
+ <tr>
944
+ <th align="left">Task</th>
945
+ <th colspan="2">Voice cloning</th>
946
+ </tr>
947
+ <tr>
948
+ <th align="left">Metric</th>
949
+ <th>SIMO↑</th>
950
+ <th>SIMO↑</th>
951
+ </tr>
952
+ <tr>
953
+ <th align="left">Dataset</th>
954
+ <th>Seed-TTS test-zh</th>
955
+ <th>Seed-TTS test-en</th>
956
+ </tr>
957
+ </thead>
958
+ <tbody align="center">
959
+ <tr>
960
+ <td nowrap="nowrap" align="left">F5-TTS</td>
961
+ <td><strong>76</strong></td>
962
+ <td><strong>67</strong></td>
963
+ </tr>
964
+ <tr>
965
+ <td nowrap="nowrap" align="left">CosyVoice</td>
966
+ <td><u>75</u></td>
967
+ <td><u>64</u></td>
968
+ </tr>
969
+ <tr>
970
+ <td nowrap="nowrap" align="left">FireRedTTS</td>
971
+ <td>63</td>
972
+ <td>46</td>
973
+ </tr>
974
+ <tr>
975
+ <td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
976
+ <td>57</td>
977
+ <td>47</td>
978
+ </tr>
979
+ </tbody>
980
+ </table>
981
+ </div>
982
+
983
+ </details>
984
+
985
+ <details>
986
+ <summary>Click to view multimodal live streaming results.</summary>
987
+
988
+ **Multimodal Live Streaming**: results on StreamingBench
989
+
990
+ <table style="margin: 0px auto;">
991
+ <thead>
992
+ <tr>
993
+ <th align="left">Model</th>
994
+ <th>Size</th>
995
+ <th>Real-Time Video Understanding</th>
996
+ <th>Omni-Source Understanding</th>
997
+ <th>Contextual Understanding</th>
998
+ <th>Overall</th>
999
+ </tr>
1000
+ </thead>
1001
+ <tbody align="center">
1002
+ <tr>
1003
+ <td colspan="7" align="left"><strong>Proprietary</strong></td>
1004
+ </tr>
1005
+ <tr>
1006
+ <td nowrap="nowrap" align="left">Gemini 1.5 Pro</td>
1007
+ <td>-</td>
1008
+ <td><u>77.4</u></td>
1009
+ <td><strong>67.8</strong></td>
1010
+ <td><strong>51.1</strong></td>
1011
+ <td><strong>70.3</strong></td>
1012
+ </tr>
1013
+ <tr>
1014
+ <td nowrap="nowrap" align="left">GPT-4o-202408</td>
1015
+ <td>-</td>
1016
+ <td>74.5</td>
1017
+ <td>51.0</td>
1018
+ <td><u>48.0</u></td>
1019
+ <td>64.1</td>
1020
+ </tr>
1021
+ <tr>
1022
+ <td nowrap="nowrap" align="left">Claude-3.5-Sonnet</td>
1023
+ <td>-</td>
1024
+ <td>74.0</td>
1025
+ <td>41.4</td>
1026
+ <td>37.8</td>
1027
+ <td>59.7</td>
1028
+ </tr>
1029
+ <tr>
1030
+ <td colspan="9" align="left"><strong>Open-source</strong></td>
1031
+ </tr>
1032
+ <tr>
1033
+ <td nowrap="nowrap" align="left">VILA-1.5</td>
1034
+ <td>8B</td>
1035
+ <td>61.5</td>
1036
+ <td>37.5</td>
1037
+ <td>26.7</td>
1038
+ <td>49.5</td>
1039
+ </tr>
1040
+ <tr>
1041
+ <td nowrap="nowrap" align="left">LongVA</td>
1042
+ <td>7B</td>
1043
+ <td>63.1</td>
1044
+ <td>35.9</td>
1045
+ <td>30.2</td>
1046
+ <td>50.7</td>
1047
+ </tr>
1048
+ <tr>
1049
+ <td nowrap="nowrap" align="left">LLaVA-Next-Video-34B</td>
1050
+ <td>34B</td>
1051
+ <td>69.8</td>
1052
+ <td>41.7</td>
1053
+ <td>34.3</td>
1054
+ <td>56.7</td>
1055
+ </tr>
1056
+ <tr>
1057
+ <td nowrap="nowrap" align="left">Qwen2-VL-7B</td>
1058
+ <td>8B</td>
1059
+ <td>71.2</td>
1060
+ <td>40.7</td>
1061
+ <td>33.1</td>
1062
+ <td>57.0</td>
1063
+ </tr>
1064
+ <tr>
1065
+ <td nowrap="nowrap" align="left">InternVL2-8B</td>
1066
+ <td>8B</td>
1067
+ <td>70.1</td>
1068
+ <td>42.7</td>
1069
+ <td>34.1</td>
1070
+ <td>57.0</td>
1071
+ </tr>
1072
+ <tr>
1073
+ <td nowrap="nowrap" align="left">VITA-1.5</td>
1074
+ <td>8B</td>
1075
+ <td>70.9</td>
1076
+ <td>40.8</td>
1077
+ <td>35.8</td>
1078
+ <td>57.4</td>
1079
+ </tr>
1080
+ <tr>
1081
+ <td nowrap="nowrap" align="left">LLaVA-OneVision-7B</td>
1082
+ <td>8B</td>
1083
+ <td>74.3</td>
1084
+ <td>40.8</td>
1085
+ <td>31.0</td>
1086
+ <td>58.4</td>
1087
+ </tr>
1088
+ <tr>
1089
+ <td nowrap="nowrap" align="left">InternLM-XC2.5-OL-7B</td>
1090
+ <td>8B</td>
1091
+ <td>75.4</td>
1092
+ <td>46.2</td>
1093
+ <td>33.6</td>
1094
+ <td>60.8</td>
1095
+ </tr>
1096
+ <tr>
1097
+ <td nowrap="nowrap" align="left">MiniCPM-V 2.6</td>
1098
+ <td>8B</td>
1099
+ <td>72.4</td>
1100
+ <td>40.2</td>
1101
+ <td>33.4</td>
1102
+ <td>57.7</td>
1103
+ </tr>
1104
+ <tr>
1105
+ <td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
1106
+ <td>8B</td>
1107
+ <td><strong>79.9</strong></td>
1108
+ <td><u>53.4</u></td>
1109
+ <td>38.5</td>
1110
+ <td><u>66.0</u></td>
1111
+ </tr>
1112
+ </tbody>
1113
+ </table>
1114
+
1115
+ </details>
1116
+
1117
+
1118
+ ### Examples <!-- omit in toc -->
1119
+
1120
+ We deploy MiniCPM-o 2.6 on end devices. The demo video is the raw-speed recording on an iPad Pro and a Web demo.
1121
+
1122
+ <div align="center">
1123
+ <a href="https://www.youtube.com/watch?v=vRIMbxJzStY&t=2s"><img src="https://raw.githubusercontent.com/openbmb/MiniCPM-o/main/assets/minicpmo2_6/2dot6_o_demo_video_img.png", width=70%></a>
1124
+ </div>
1125
+
1126
+ <br>
1127
+
1128
+ <div style="display: flex; flex-direction: column; align-items: center;">
1129
+ <img src="https://raw.githubusercontent.com/openbmb/MiniCPM-o/main/assets/minicpmo2_6/minicpmo2_6_math_intersect.png" alt="math" style="margin-bottom: 5px;">
1130
+ <img src="https://raw.githubusercontent.com/openbmb/MiniCPM-o/main/assets/minicpmo2_6/minicpmo2_6_diagram_train_NN.png" alt="diagram" style="margin-bottom: 5px;">
1131
+ <img src="https://raw.githubusercontent.com/openbmb/MiniCPM-o/main/assets/minicpmo2_6/minicpmo2_6_multi-image_bike.png" alt="bike" style="margin-bottom: 5px;">
1132
+ </div>
1133
+
1134
+
1135
+ ## Legacy Models <!-- omit in toc -->
1136
+
1137
+ | Model | Introduction and Guidance |
1138
+ |:----------------------|:-------------------:|
1139
+ | MiniCPM-V 4.0 | [Document](./docs/minicpm_v4_en.md) |
1140
+ | MiniCPM-V 2.6 | [Document](./docs/minicpm_v2dot6_en.md) |
1141
+ | MiniCPM-Llama3-V 2.5 | [Document](./docs/minicpm_llama3_v2dot5.md) |
1142
+ | MiniCPM-V 2.0 | [Document](./docs/minicpm_v2.md) |
1143
+ | MiniCPM-V 1.0 | [Document](./docs/minicpm_v1.md) |
1144
+ | OmniLMM-12B | [Document](././docs/omnilmm_en.md) |
1145
+
1146
+
1147
+ ## MiniCPM-V & o Cookbook
1148
+
1149
+ Discover comprehensive, ready-to-deploy solutions for the MiniCPM-V and MiniCPM-o model series in our structured [cookbook](https://github.com/OpenSQZ/MiniCPM-V-CookBook), which empowers developers to rapidly implement multimodal AI applications with integrated vision, speech, and live-streaming capabilities. Key features include:
1150
+
1151
+ **Easy Usage Documentation**
1152
+
1153
+ Our comprehensive [documentation website](https://minicpm-o.readthedocs.io/en/latest/index.html) presents every recipe in a clear, well-organized manner.
1154
+ All features are displayed at a glance, making it easy for you to quickly find exactly what you need.
1155
+
1156
+ **Broad User Spectrum**
1157
+
1158
+ We support a wide range of users, from individuals to enterprises and researchers.
1159
+
1160
+ * **Individuals**: Enjoy effortless inference using [Ollama](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/deployment/ollama/minicpm-v4_ollama.md) and [Llama.cpp](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/deployment/llama.cpp/minicpm-v4_llamacpp.md) with minimal setup.
1161
+ * **Enterprises**: Achieve high-throughput, scalable performance with [vLLM](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/deployment/vllm/minicpm-v4_vllm.md) and [SGLang](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/deployment/sglang/MiniCPM-v4_sglang.md).
1162
+ * **Researchers**: Leverage advanced frameworks including [Transformers](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/finetune/finetune_full.md), [LLaMA-Factory](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/finetune/finetune_llamafactory.md), [SWIFT](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/finetune/swift.md), and [Align-anything](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/finetune/align_anything.md) to enable flexible model development and cutting-edge experimentation.
1163
+
1164
+ **Versatile Deployment Scenarios**
1165
+
1166
+ Our ecosystem delivers optimal solution for a variety of hardware environments and deployment demands.
1167
+
1168
+ * **Web demo**: Launch interactive multimodal AI web demo with [FastAPI](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/demo/README.md).
1169
+ * **Quantized deployment**: Maximize efficiency and minimize resource consumption using [GGUF](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/quantization/gguf/minicpm-v4_gguf_quantize.md) and [BNB](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/quantization/bnb/minicpm-v4_bnb_quantize.md).
1170
+ * **End devices**: Bring powerful AI experiences to [iPhone and iPad](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/demo/ios_demo/ios.md), supporting offline and privacy-sensitive applications.
1171
+
1172
+
1173
+ ## Chat with Our Demo on Gradio 🤗
1174
+
1175
+ We provide online and local demos powered by Hugging Face Gradio <a href='https://github.com/gradio-app/gradio'><img src='https://img.shields.io/github/stars/gradio-app/gradio'></a>, the most popular model deployment framework nowadays. It supports streaming outputs, progress bars, queuing, alerts, and other useful features.
1176
+
1177
+
1178
+ ### Online Demo <!-- omit in toc -->
1179
+
1180
+ Click here to try out the online demo of [MiniCPM-o 2.6](https://minicpm-omni-webdemo-us.modelbest.cn/) | [MiniCPM-V 2.6](http://120.92.209.146:8887/) | [MiniCPM-Llama3-V 2.5](https://huggingface.co/spaces/openbmb/MiniCPM-Llama3-V-2_5) | [MiniCPM-V 2.0](https://huggingface.co/spaces/openbmb/MiniCPM-V-2).
1181
+
1182
+ ### Local WebUI Demo <!-- omit in toc -->
1183
+
1184
+ You can easily build your own local WebUI demo using the following commands.
1185
+
1186
+ Please ensure that `transformers==4.44.2` is installed, as other versions may have compatibility issues.
1187
+
1188
+ If you are using an older version of PyTorch, you might encounter this issue `"weight_norm_fwd_first_dim_kernel" not implemented for 'BFloat16'`, Please add `self.minicpmo_model.tts.float()` during the model initialization.
1189
+
1190
+ **For real-time voice/video call demo:**
1191
+ 1. launch model server:
1192
+ ```shell
1193
+ pip install -r requirements_o2.6.txt
1194
+
1195
+ python web_demos/minicpm-o_2.6/model_server.py
1196
+ ```
1197
+
1198
+ 2. launch web server:
1199
+
1200
+ ```shell
1201
+ # Make sure Node and PNPM is installed.
1202
+ sudo apt-get update
1203
+ sudo apt-get install nodejs npm
1204
+ npm install -g pnpm
1205
+
1206
+
1207
+ cd web_demos/minicpm-o_2.6/web_server
1208
+ # create ssl cert for https, https is required to request camera and microphone permissions.
1209
+ bash ./make_ssl_cert.sh # output key.pem and cert.pem
1210
+
1211
+ pnpm install # install requirements
1212
+ pnpm run dev # start server
1213
+ ```
1214
+ Open `https://localhost:8088/` in browser and enjoy the real-time voice/video call.
1215
+
1216
+ **For chatbot demo:**
1217
+ ```shell
1218
+ pip install -r requirements_o2.6.txt
1219
+
1220
+ python web_demos/minicpm-o_2.6/chatbot_web_demo_o2.6.py
1221
+ ```
1222
+ Open `http://localhost:8000/` in browser and enjoy the vision mode chatbot.
1223
+
1224
+ ## Inference
1225
+
1226
+
1227
+ ### Model Zoo
1228
+
1229
+ | Model | Device | Memory | &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; Description | Download |
1230
+ |:-----------|:--:|:-----------:|:-------------------|:---------------:|
1231
+ | MiniCPM-V 4.5| GPU | 18 GB | The latest version, strong end-side multimodal performance for single image, multi-image and video understanding. | [🤗](https://huggingface.co/openbmb/MiniCPM-V-4_5) &nbsp;&nbsp; [<img src="https://raw.githubusercontent.com/OpenBMB/MiniCPM-o/main/assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-V-4_5) |
1232
+ | MiniCPM-V 4.5 gguf | CPU | 8 GB | The gguf version, lower memory usage and faster inference. | [🤗](https://huggingface.co/openbmb/MiniCPM-V-4_5-gguf) &nbsp;&nbsp; [<img src="https://raw.githubusercontent.com/OpenBMB/MiniCPM-o/main/assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-V-4_5-gguf) |
1233
+ | MiniCPM-V 4.5 int4 | GPU | 9 GB | The int4 quantized version, lower GPU memory usage. | [🤗](https://huggingface.co/openbmb/MiniCPM-V-4_5-int4) &nbsp;&nbsp; [<img src="https://raw.githubusercontent.com/OpenBMB/MiniCPM-o/main/assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-V-4_5-int4) |
1234
+ | MiniCPM-V 4.5 AWQ | GPU | 9 GB | The int4 quantized version, lower GPU memory usage. | [🤗](https://huggingface.co/openbmb/MiniCPM-V-4_5-AWQ) &nbsp;&nbsp; [<img src="https://raw.githubusercontent.com/OpenBMB/MiniCPM-o/main/assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-V-4_5-AWQ) |
1235
+ | MiniCPM-o 2.6| GPU | 18 GB | The latest version, achieving GPT-4o level performance for vision, speech and multimodal live streaming on end-side devices. | [🤗](https://huggingface.co/openbmb/MiniCPM-o-2_6) &nbsp;&nbsp; [<img src="https://raw.githubusercontent.com/OpenBMB/MiniCPM-o/main/assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-o-2_6) |
1236
+ | MiniCPM-o 2.6 gguf | CPU | 8 GB | The gguf version, lower memory usage and faster inference. | [🤗](https://huggingface.co/openbmb/MiniCPM-o-2_6-gguf) &nbsp;&nbsp; [<img src="https://raw.githubusercontent.com/OpenBMB/MiniCPM-o/main/assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-o-2_6-gguf) |
1237
+ | MiniCPM-o 2.6 int4 | GPU | 9 GB | The int4 quantized version, lower GPU memory usage. | [🤗](https://huggingface.co/openbmb/MiniCPM-o-2_6-int4) &nbsp;&nbsp; [<img src="https://raw.githubusercontent.com/OpenBMB/MiniCPM-o/main/assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-o-2_6-int4) |
1238
+
1239
+ ### Multi-turn Conversation
1240
+
1241
+ If you wish to enable long-thinking mode, provide the argument `enable_thinking=True` to the chat function.
1242
+
1243
+ ```shell
1244
+ pip install -r requirements_o2.6.txt
1245
+ ```
1246
+
1247
+ Please refer to the following codes to run.
1248
+
1249
+ <div align="center">
1250
+ <img src="https://raw.githubusercontent.com/OpenBMB/MiniCPM-o/main/assets/minicpmo2_6/show_demo.jpg" width="500px">
1251
+ </div>
1252
+
1253
+
1254
+ ```python
1255
+ import torch
1256
+ from PIL import Image
1257
+ from transformers import AutoModel, AutoTokenizer
1258
+
1259
+ torch.manual_seed(100)
1260
+
1261
+ model = AutoModel.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True, # or openbmb/MiniCPM-o-2_6
1262
+ attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
1263
+ model = model.eval().cuda()
1264
+ tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True) # or openbmb/MiniCPM-o-2_6
1265
+
1266
+ image = Image.open('./assets/minicpmo2_6/show_demo.jpg').convert('RGB')
1267
+
1268
+ enable_thinking=False # If `enable_thinking=True`, the long-thinking mode is enabled.
1269
+
1270
+ # First round chat
1271
+ question = "What is the landform in the picture?"
1272
+ msgs = [{'role': 'user', 'content': [image, question]}]
1273
+
1274
+ answer = model.chat(
1275
+ msgs=msgs,
1276
+ tokenizer=tokenizer,
1277
+ enable_thinking=enable_thinking
1278
+ )
1279
+ print(answer)
1280
+
1281
+ # Second round chat, pass history context of multi-turn conversation
1282
+ msgs.append({"role": "assistant", "content": [answer]})
1283
+ msgs.append({"role": "user", "content": ["What should I pay attention to when traveling here?"]})
1284
+
1285
+ answer = model.chat(
1286
+ msgs=msgs,
1287
+ tokenizer=tokenizer
1288
+ )
1289
+ print(answer)
1290
+ ```
1291
+
1292
+ You will get the following output:
1293
+
1294
+ ```shell
1295
+ # round1
1296
+ The landform in the picture is karst topography. Karst landscapes are characterized by distinctive, jagged limestone hills or mountains with steep, irregular peaks and deep valleys—exactly what you see here These unique formations result from the dissolution of soluble rocks like limestone over millions of years through water erosion.
1297
+
1298
+ This scene closely resembles the famous karst landscape of Guilin and Yangshuo in China’s Guangxi Province. The area features dramatic, pointed limestone peaks rising dramatically above serene rivers and lush green forests, creating a breathtaking and iconic natural beauty that attracts millions of visitors each year for its picturesque views.
1299
+
1300
+ # round2
1301
+ When traveling to a karst landscape like this, here are some important tips:
1302
+
1303
+ 1. Wear comfortable shoes: The terrain can be uneven and hilly.
1304
+ 2. Bring water and snacks for energy during hikes or boat rides.
1305
+ 3. Protect yourself from the sun with sunscreen, hats, and sunglasses—especially since you’ll likely spend time outdoors exploring scenic spots.
1306
+ 4. Respect local customs and nature regulations by not littering or disturbing wildlife.
1307
+
1308
+ By following these guidelines, you'll have a safe and enjoyable trip while appreciating the stunning natural beauty of places such as Guilin’s karst mountains.
1309
+ ```
1310
+
1311
+ #### Chat with Multiple Images
1312
+ <details>
1313
+ <summary> Click to view Python code running MiniCPM-V-4_5 with multiple images input. </summary>
1314
+
1315
+ ```python
1316
+ import torch
1317
+ from PIL import Image
1318
+ from transformers import AutoModel, AutoTokenizer
1319
+
1320
+ model = AutoModel.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True, # or openbmb/MiniCPM-o-2_6
1321
+ attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
1322
+ model = model.eval().cuda()
1323
+ tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True) # or openbmb/MiniCPM-o-2_6
1324
+
1325
+ image1 = Image.open('image1.jpg').convert('RGB')
1326
+ image2 = Image.open('image2.jpg').convert('RGB')
1327
+ question = 'Compare image 1 and image 2, tell me about the differences between image 1 and image 2.'
1328
+
1329
+ msgs = [{'role': 'user', 'content': [image1, image2, question]}]
1330
+
1331
+ answer = model.chat(
1332
+ msgs=msgs,
1333
+ tokenizer=tokenizer
1334
+ )
1335
+ print(answer)
1336
+ ```
1337
+ </details>
1338
+
1339
+ #### In-context Few-shot Learning
1340
+ <details>
1341
+ <summary> Click to view Python code running MiniCPM-V-4_5 with few-shot input. </summary>
1342
+
1343
+ ```python
1344
+ import torch
1345
+ from PIL import Image
1346
+ from transformers import AutoModel, AutoTokenizer
1347
+
1348
+ model = AutoModel.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True, # or openbmb/MiniCPM-o-2_6
1349
+ attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
1350
+ model = model.eval().cuda()
1351
+ tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True) # or openbmb/MiniCPM-o-2_6
1352
+
1353
+ question = "production date"
1354
+ image1 = Image.open('example1.jpg').convert('RGB')
1355
+ answer1 = "2023.08.04"
1356
+ image2 = Image.open('example2.jpg').convert('RGB')
1357
+ answer2 = "2007.04.24"
1358
+ image_test = Image.open('test.jpg').convert('RGB')
1359
+
1360
+ msgs = [
1361
+ {'role': 'user', 'content': [image1, question]}, {'role': 'assistant', 'content': [answer1]},
1362
+ {'role': 'user', 'content': [image2, question]}, {'role': 'assistant', 'content': [answer2]},
1363
+ {'role': 'user', 'content': [image_test, question]}
1364
+ ]
1365
+
1366
+ answer = model.chat(
1367
+ msgs=msgs,
1368
+ tokenizer=tokenizer
1369
+ )
1370
+ print(answer)
1371
+ ```
1372
+ </details>
1373
+
1374
+ #### Chat with Video
1375
+ <details>
1376
+ <summary> Click to view Python code running MiniCPM-V-4_5 by with video input and 3D-Resampler. </summary>
1377
+
1378
+ ```python
1379
+ ## The 3d-resampler compresses multiple frames into 64 tokens by introducing temporal_ids.
1380
+ # To achieve this, you need to organize your video data into two corresponding sequences:
1381
+ # frames: List[Image]
1382
+ # temporal_ids: List[List[Int]].
1383
+
1384
+ import torch
1385
+ from PIL import Image
1386
+ from transformers import AutoModel, AutoTokenizer
1387
+ from decord import VideoReader, cpu # pip install decord
1388
+ from scipy.spatial import cKDTree
1389
+ import numpy as np
1390
+ import math
1391
+
1392
+ model = AutoModel.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True, # or openbmb/MiniCPM-o-2_6
1393
+ attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
1394
+ model = model.eval().cuda()
1395
+ tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True) # or openbmb/MiniCPM-o-2_6
1396
+
1397
+ MAX_NUM_FRAMES=180 # Indicates the maximum number of frames received after the videos are packed. The actual maximum number of valid frames is MAX_NUM_FRAMES * MAX_NUM_PACKING.
1398
+ MAX_NUM_PACKING=3 # indicates the maximum packing number of video frames. valid range: 1-6
1399
+ TIME_SCALE = 0.1
1400
+
1401
+ def map_to_nearest_scale(values, scale):
1402
+ tree = cKDTree(np.asarray(scale)[:, None])
1403
+ _, indices = tree.query(np.asarray(values)[:, None])
1404
+ return np.asarray(scale)[indices]
1405
+
1406
+
1407
+ def group_array(arr, size):
1408
+ return [arr[i:i+size] for i in range(0, len(arr), size)]
1409
+
1410
+ def encode_video(video_path, choose_fps=3, force_packing=None):
1411
+ def uniform_sample(l, n):
1412
+ gap = len(l) / n
1413
+ idxs = [int(i * gap + gap / 2) for i in range(n)]
1414
+ return [l[i] for i in idxs]
1415
+ vr = VideoReader(video_path, ctx=cpu(0))
1416
+ fps = vr.get_avg_fps()
1417
+ video_duration = len(vr) / fps
1418
+
1419
+ if choose_fps * int(video_duration) <= MAX_NUM_FRAMES:
1420
+ packing_nums = 1
1421
+ choose_frames = round(min(choose_fps, round(fps)) * min(MAX_NUM_FRAMES, video_duration))
1422
+
1423
+ else:
1424
+ packing_nums = math.ceil(video_duration * choose_fps / MAX_NUM_FRAMES)
1425
+ if packing_nums <= MAX_NUM_PACKING:
1426
+ choose_frames = round(video_duration * choose_fps)
1427
+ else:
1428
+ choose_frames = round(MAX_NUM_FRAMES * MAX_NUM_PACKING)
1429
+ packing_nums = MAX_NUM_PACKING
1430
+
1431
+ frame_idx = [i for i in range(0, len(vr))]
1432
+ frame_idx = np.array(uniform_sample(frame_idx, choose_frames))
1433
+
1434
+ if force_packing:
1435
+ packing_nums = min(force_packing, MAX_NUM_PACKING)
1436
+
1437
+ print(video_path, ' duration:', video_duration)
1438
+ print(f'get video frames={len(frame_idx)}, packing_nums={packing_nums}')
1439
+
1440
  frames = vr.get_batch(frame_idx).asnumpy()
1441
 
1442
  frame_idx_ts = frame_idx / fps
 
1473
  )
1474
  print(answer)
1475
  ```
1476
+ </details>
1477
+
1478
+
1479
+ #### Speech and Audio Mode
1480
+
1481
+ Model initialization
1482
 
 
 
 
 
1483
  ```python
1484
  import torch
1485
+ import librosa
1486
+ from transformers import AutoModel, AutoTokenizer
1487
+
1488
+ model = AutoModel.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True,
1489
+ attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
1490
+ model = model.eval().cuda()
1491
+ tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True)
1492
+
1493
+ model.init_tts()
1494
+ model.tts.float()
1495
+ ```
1496
+
1497
+ <hr/>
1498
+
1499
+ ##### Mimick <!-- omit in toc -->
1500
+
1501
+ `Mimick` task reflects a model's end-to-end speech modeling capability. The model takes audio input, and outputs an ASR transcription and subsequently reconstructs the original audio with high similarity. The higher the similarity between the reconstructed audio and the original audio, the stronger the model's foundational capability in end-to-end speech modeling.
1502
+
1503
+ ```python
1504
+ mimick_prompt = "Please repeat each user's speech, including voice style and speech content."
1505
+ audio_input, _ = librosa.load('./assets/input_examples/Trump_WEF_2018_10s.mp3', sr=16000, mono=True) # load the audio to be mimicked
1506
+
1507
+ # `./assets/input_examples/fast-pace.wav`,
1508
+ # `./assets/input_examples/chi-english-1.wav`
1509
+ # `./assets/input_examples/exciting-emotion.wav`
1510
+ # for different aspects of speech-centric features.
1511
+
1512
+ msgs = [{'role': 'user', 'content': [mimick_prompt, audio_input]}]
1513
+ res = model.chat(
1514
+ msgs=msgs,
1515
+ tokenizer=tokenizer,
1516
+ sampling=True,
1517
+ max_new_tokens=128,
1518
+ use_tts_template=True,
1519
+ temperature=0.3,
1520
+ generate_audio=True,
1521
+ output_audio_path='output_mimick.wav', # save the tts result to output_audio_path
1522
+ )
1523
+ ```
1524
+
1525
+ <hr/>
1526
+
1527
+ ##### General Speech Conversation with Configurable Voices <!-- omit in toc -->
1528
+
1529
+ A general usage scenario of `MiniCPM-o-2.6` is role-playing a specific character based on the audio prompt. It will mimic the voice of the character to some extent and act like the character in text, including language style. In this mode, `MiniCPM-o-2.6` sounds **more natural and human-like**. Self-defined audio prompts can be used to customize the voice of the character in an end-to-end manner.
1530
+
1531
+
1532
+ ```python
1533
+ ref_audio, _ = librosa.load('./assets/input_examples/icl_20.wav', sr=16000, mono=True) # load the reference audio
1534
+ sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='audio_roleplay', language='en')
1535
+
1536
+ # round one
1537
+ user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]}
1538
+ msgs = [sys_prompt, user_question]
1539
+ res = model.chat(
1540
+ msgs=msgs,
1541
+ tokenizer=tokenizer,
1542
+ sampling=True,
1543
+ max_new_tokens=128,
1544
+ use_tts_template=True,
1545
+ generate_audio=True,
1546
+ temperature=0.3,
1547
+ output_audio_path='result_roleplay_round_1.wav',
1548
+ )
1549
+
1550
+ # round two
1551
+ history = msgs.append({'role': 'assistant', 'content': res})
1552
+ user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]}
1553
+ msgs = history.append(user_question)
1554
+ res = model.chat(
1555
+ msgs=msgs,
1556
+ tokenizer=tokenizer,
1557
+ sampling=True,
1558
+ max_new_tokens=128,
1559
+ use_tts_template=True,
1560
+ generate_audio=True,
1561
+ temperature=0.3,
1562
+ output_audio_path='result_roleplay_round_2.wav',
1563
+ )
1564
+ print(res)
1565
+ ```
1566
+
1567
+ <hr/>
1568
+
1569
+ ##### Speech Conversation as an AI Assistant <!-- omit in toc -->
1570
+
1571
+ An enhanced feature of `MiniCPM-o-2.6` is to act as an AI assistant, but only with limited choice of voices. In this mode, `MiniCPM-o-2.6` is **less human-like and more like a voice assistant**. In this mode, the model is more instruction-following. For demo, you are suggested to use `assistant_female_voice`, `assistant_male_voice`, and `assistant_default_female_voice`. Other voices may work but not as stable as the default voices.
1572
+
1573
+ *Please note that, `assistant_female_voice` and `assistant_male_voice` are more stable but sounds like robots, while `assistant_default_female_voice` is more human-alike but not stable, its voice often changes in multiple turns. We suggest you to try stable voices `assistant_female_voice` and `assistant_male_voice`.*
1574
+
1575
+ ```python
1576
+ ref_audio, _ = librosa.load('./assets/input_examples/assistant_female_voice.wav', sr=16000, mono=True) # or use `./assets/input_examples/assistant_male_voice.wav`
1577
+ sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='audio_assistant', language='en')
1578
+ user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]} # load the user's audio question
1579
+
1580
+ # round one
1581
+ msgs = [sys_prompt, user_question]
1582
+ res = model.chat(
1583
+ msgs=msgs,
1584
+ tokenizer=tokenizer,
1585
+ sampling=True,
1586
+ max_new_tokens=128,
1587
+ use_tts_template=True,
1588
+ generate_audio=True,
1589
+ temperature=0.3,
1590
+ output_audio_path='result_assistant_round_1.wav',
1591
+ )
1592
+
1593
+ # round two
1594
+ history = msgs.append({'role': 'assistant', 'content': res})
1595
+ user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]}
1596
+ msgs = history.append(user_question)
1597
+ res = model.chat(
1598
+ msgs=msgs,
1599
+ tokenizer=tokenizer,
1600
+ sampling=True,
1601
+ max_new_tokens=128,
1602
+ use_tts_template=True,
1603
+ generate_audio=True,
1604
+ temperature=0.3,
1605
+ output_audio_path='result_assistant_round_2.wav',
1606
+ )
1607
+ print(res)
1608
+ ```
1609
+
1610
+ <hr/>
1611
+
1612
+ ##### Instruction-to-Speech <!-- omit in toc -->
1613
+
1614
+ `MiniCPM-o-2.6` can also do Instruction-to-Speech, aka **Voice Creation**. You can describe a voice in detail, and the model will generate a voice that matches the description. For more Instruction-to-Speech sample instructions, you can refer to https://voxinstruct.github.io/VoxInstruct/.
1615
+
1616
+ ```python
1617
+ instruction = 'Speak like a male charming superstar, radiating confidence and style in every word.'
1618
+
1619
+ msgs = [{'role': 'user', 'content': [instruction]}]
1620
+
1621
+ res = model.chat(
1622
+ msgs=msgs,
1623
+ tokenizer=tokenizer,
1624
+ sampling=True,
1625
+ max_new_tokens=128,
1626
+ use_tts_template=True,
1627
+ generate_audio=True,
1628
+ temperature=0.3,
1629
+ output_audio_path='result_voice_creation.wav',
1630
+ )
1631
+ ```
1632
+
1633
+ <hr/>
1634
+
1635
+ ##### Voice Cloning <!-- omit in toc -->
1636
+
1637
+ `MiniCPM-o-2.6` can also do zero-shot text-to-speech, aka **Voice Cloning**. With this mode, model will act like a TTS model.
1638
+
1639
+
1640
+ ```python
1641
+ ref_audio, _ = librosa.load('./assets/input_examples/icl_20.wav', sr=16000, mono=True) # load the reference audio
1642
+ sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='voice_cloning', language='en')
1643
+ text_prompt = f"Please read the text below."
1644
+ user_question = {'role': 'user', 'content': [text_prompt, "content that you want to read"]}
1645
+
1646
+ msgs = [sys_prompt, user_question]
1647
+ res = model.chat(
1648
+ msgs=msgs,
1649
+ tokenizer=tokenizer,
1650
+ sampling=True,
1651
+ max_new_tokens=128,
1652
+ use_tts_template=True,
1653
+ generate_audio=True,
1654
+ temperature=0.3,
1655
+ output_audio_path='result_voice_cloning.wav',
1656
+ )
1657
+
1658
+ ```
1659
+
1660
+ <hr/>
1661
+
1662
+ ##### Addressing Various Audio Understanding Tasks <!-- omit in toc -->
1663
+
1664
+ `MiniCPM-o-2.6` can also be used to address various audio understanding tasks, such as ASR, speaker analysis, general audio captioning, and sound scene tagging.
1665
+
1666
+ For audio-to-text tasks, you can use the following prompts:
1667
+
1668
+ - ASR with ZH(same as AST en2zh): `请仔细听这段音频片段,并将其内容逐字记录。`
1669
+ - ASR with EN(same as AST zh2en): `Please listen to the audio snippet carefully and transcribe the content.`
1670
+ - Speaker Analysis: `Based on the speaker's content, speculate on their gender, condition, age range, and health status.`
1671
+ - General Audio Caption: `Summarize the main content of the audio.`
1672
+ - General Sound Scene Tagging: `Utilize one keyword to convey the audio's content or the associated scene.`
1673
+
1674
+ ```python
1675
+ task_prompt = "Please listen to the audio snippet carefully and transcribe the content." + "
1676
+ " # can change to other prompts.
1677
+ audio_input, _ = librosa.load('./assets/input_examples/audio_understanding.mp3', sr=16000, mono=True) # load the audio to be captioned
1678
+
1679
+ msgs = [{'role': 'user', 'content': [task_prompt, audio_input]}]
1680
+
1681
+ res = model.chat(
1682
+ msgs=msgs,
1683
+ tokenizer=tokenizer,
1684
+ sampling=True,
1685
+ max_new_tokens=128,
1686
+ use_tts_template=True,
1687
+ generate_audio=True,
1688
+ temperature=0.3,
1689
+ output_audio_path='result_audio_understanding.wav',
1690
+ )
1691
+ print(res)
1692
+ ```
1693
+
1694
+
1695
+
1696
+
1697
+ #### Multimodal Live Streaming
1698
+ <details>
1699
+ <summary> Click to view Python code running MiniCPM-o 2.6 with chat inference. </summary>
1700
+
1701
+ ```python
1702
+ import math
1703
+ import numpy as np
1704
  from PIL import Image
1705
+ from moviepy.editor import VideoFileClip
1706
+ import tempfile
1707
+ import librosa
1708
+ import soundfile as sf
1709
+ import torch
1710
  from transformers import AutoModel, AutoTokenizer
1711
 
1712
+ def get_video_chunk_content(video_path, flatten=True):
1713
+ video = VideoFileClip(video_path)
1714
+ print('video_duration:', video.duration)
1715
+
1716
+ with tempfile.NamedTemporaryFile(suffix=".wav", delete=True) as temp_audio_file:
1717
+ temp_audio_file_path = temp_audio_file.name
1718
+ video.audio.write_audiofile(temp_audio_file_path, codec="pcm_s16le", fps=16000)
1719
+ audio_np, sr = librosa.load(temp_audio_file_path, sr=16000, mono=True)
1720
+ num_units = math.ceil(video.duration)
1721
+
1722
+ # 1 frame + 1s audio chunk
1723
+ contents= []
1724
+ for i in range(num_units):
1725
+ frame = video.get_frame(i+1)
1726
+ image = Image.fromarray((frame).astype(np.uint8))
1727
+ audio = audio_np[sr*i:sr*(i+1)]
1728
+ if flatten:
1729
+ contents.extend(["<unit>", image, audio])
1730
+ else:
1731
+ contents.append(["<unit>", image, audio])
1732
+
1733
+ return contents
1734
+
1735
+
1736
+ model = AutoModel.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True,
1737
+ attn_implementation='sdpa', torch_dtype=torch.bfloat16)
1738
  model = model.eval().cuda()
1739
+ tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True)
1740
 
1741
+ model.init_tts()
 
 
1742
 
1743
+ # If you are using an older version of PyTorch, you might encounter this issue "weight_norm_fwd_first_dim_kernel" not implemented for 'BFloat16', Please convert the TTS to float32 type.
1744
+ # model.tts.float()
1745
 
1746
+ # https://huggingface.co/openbmb/MiniCPM-o-2_6/blob/main/assets/Skiing.mp4
1747
+ video_path="assets/Skiing.mp4"
1748
+ sys_msg = model.get_sys_prompt(mode='omni', language='en')
1749
+ # if use voice clone prompt, please set ref_audio
1750
+ # ref_audio_path = '/path/to/ref_audio'
1751
+ # ref_audio, _ = librosa.load(ref_audio_path, sr=16000, mono=True)
1752
+ # sys_msg = model.get_sys_prompt(ref_audio=ref_audio, mode='omni', language='en')
1753
+
1754
+ contents = get_video_chunk_content(video_path)
1755
+ msg = {"role":"user", "content": contents}
1756
+ msgs = [sys_msg, msg]
1757
+
1758
+ # please set generate_audio=True and output_audio_path to save the tts result
1759
+ generate_audio = True
1760
+ output_audio_path = 'output.wav'
1761
+
1762
+ res = model.chat(
1763
  msgs=msgs,
1764
+ tokenizer=tokenizer,
1765
+ sampling=True,
1766
+ temperature=0.5,
1767
+ max_new_tokens=4096,
1768
+ omni_input=True, # please set omni_input=True when omni inference
1769
+ use_tts_template=True,
1770
+ generate_audio=generate_audio,
1771
+ output_audio_path=output_audio_path,
1772
+ max_slice_nums=1,
1773
+ use_image_id=False,
1774
+ return_dict=True
1775
+ )
1776
+ print(res)
1777
+ ```
1778
+ </details>
1779
+
1780
+ <details>
1781
+ <summary> Click to view Python code running MiniCPM-o 2.6 with streaming inference. </summary>
1782
+
1783
+ Note: The streaming inference has a slight performance degradation because the audio encoding is not global.
1784
+ ```python
1785
+ # a new conversation need reset session first, it will reset the kv-cache
1786
+ model.reset_session()
1787
+
1788
+ contents = get_video_chunk_content(video_path, flatten=False)
1789
+ session_id = '123'
1790
+ generate_audio = True
1791
+
1792
+ # 1. prefill system prompt
1793
+ res = model.streaming_prefill(
1794
+ session_id=session_id,
1795
+ msgs=[sys_msg],
1796
  tokenizer=tokenizer
1797
  )
1798
+
1799
+ # 2. prefill video/audio chunks
1800
+ for content in contents:
1801
+ msgs = [{"role":"user", "content": content}]
1802
+ res = model.streaming_prefill(
1803
+ session_id=session_id,
1804
+ msgs=msgs,
1805
+ tokenizer=tokenizer
1806
+ )
1807
+
1808
+ # 3. generate
1809
+ res = model.streaming_generate(
1810
+ session_id=session_id,
1811
+ tokenizer=tokenizer,
1812
+ temperature=0.5,
1813
+ generate_audio=generate_audio
1814
+ )
1815
+
1816
+ audios = []
1817
+ text = ""
1818
+
1819
+ if generate_audio:
1820
+ for r in res:
1821
+ audio_wav = r.audio_wav
1822
+ sampling_rate = r.sampling_rate
1823
+ txt = r.text
1824
+
1825
+ audios.append(audio_wav)
1826
+ text += txt
1827
+
1828
+ res = np.concatenate(audios)
1829
+ sf.write("output.wav", res, samplerate=sampling_rate)
1830
+ print("text:", text)
1831
+ print("audio saved to output.wav")
1832
+ else:
1833
+ for r in res:
1834
+ text += r['text']
1835
+ print("text:", text)
1836
  ```
1837
+
1838
  </details>
1839
 
1840
+ ### Inference on Multiple GPUs
1841
+ You can run MiniCPM-Llama3-V 2.5 on multiple low VRAM GPUs (12 GB or 16 GB) by distributing the model's layers across multiple GPUs. Please refer to this [tutorial](https://github.com/OpenBMB/MiniCPM-V/blob/main/docs/inference_on_multiple_gpus.md) for detailed instructions on how to load the model and inference using multiple low VRAM GPUs.
1842
+
1843
 
1844
+ ### Inference on Mac
1845
  <details>
1846
+ <summary>Click to view an example, to run MiniCPM-Llama3-V 2.5 on 💻 Mac with MPS (Apple silicon or AMD GPUs). </summary>
1847
 
1848
  ```python
1849
+ # test.py Need more than 16GB memory.
1850
  import torch
1851
  from PIL import Image
1852
  from transformers import AutoModel, AutoTokenizer
1853
 
1854
+ model = AutoModel.from_pretrained('openbmb/MiniCPM-Llama3-V-2_5', trust_remote_code=True, low_cpu_mem_usage=True)
1855
+ model = model.to(device='mps')
 
 
1856
 
1857
+ tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-Llama3-V-2_5', trust_remote_code=True)
1858
+ model.eval()
 
 
 
 
1859
 
1860
+ image = Image.open('./assets/hk_OCR.jpg').convert('RGB')
1861
+ question = 'Where is this photo taken?'
1862
+ msgs = [{'role': 'user', 'content': question}]
 
 
1863
 
1864
+ answer, context, _ = model.chat(
1865
+ image=image,
1866
  msgs=msgs,
1867
+ context=None,
1868
+ tokenizer=tokenizer,
1869
+ sampling=True
1870
  )
1871
  print(answer)
1872
  ```
1873
+ Run with command:
1874
+ ```shell
1875
+ PYTORCH_ENABLE_MPS_FALLBACK=1 python test.py
1876
+ ```
1877
  </details>
1878
 
1879
 
1880
+ ### Efficient Inference with llama.cpp, Ollama, vLLM
1881
+
1882
+ See [our fork of llama.cpp](https://github.com/OpenBMB/llama.cpp/tree/minicpmv-main/examples/llava/README-minicpmv2.6.md) for more detail. This implementation supports smooth inference of 16~18 token/s on iPad (test environment:iPad Pro + M4).
1883
+
1884
+ See [our fork of Ollama](https://github.com/OpenBMB/ollama/blob/minicpm-v2.6/examples/minicpm-v2.6/README.md) for more detail. This implementation supports smooth inference of 16~18 token/s on iPad (test environment:iPad Pro + M4).
1885
+
1886
+
1887
+ <details>
1888
+ <summary> vLLM now officially supports MiniCPM-V 2.6, MiniCPM-Llama3-V 2.5 and MiniCPM-V 2.0. And you can use our fork to run MiniCPM-o 2.6 for now. Click to see. </summary>
1889
+
1890
+ 1. Install vLLM(>=0.7.1):
1891
+ ```shell
1892
+ pip install vllm
1893
+ ```
1894
+
1895
+ 2. Run Example:
1896
+ * [Vision Language](https://docs.vllm.ai/en/latest/getting_started/examples/vision_language.html)
1897
+ * [Audio Language](https://docs.vllm.ai/en/latest/getting_started/examples/audio_language.html)
1898
+ </details>
1899
+
1900
+ ## Fine-tuning
1901
+
1902
+ ### Simple Fine-tuning <!-- omit in toc -->
1903
+
1904
+ We support simple fine-tuning with Hugging Face for MiniCPM-o 2.6, MiniCPM-V 2.6, MiniCPM-Llama3-V 2.5 and MiniCPM-V 2.0.
1905
+
1906
+ [Reference Document](./finetune/readme.md)
1907
+
1908
+
1909
+ ### With Align-Anything <!-- omit in toc -->
1910
+
1911
+ We support fine-tuning MiniCPM-o 2.6 by PKU-Alignment Team (both vision and audio, SFT and DPO) with the [Align-Anything framework](https://github.com/PKU-Alignment/align-anything). Align-Anything is a scalable framework that aims to align any-modality large models with human intentions, open-sourcing the [datasets, models and benchmarks](https://huggingface.co/datasets/PKU-Alignment/align-anything). Benefiting from its concise and modular design, it supports 30+ open-source benchmarks, 40+ models and algorithms including SFT, SimPO, RLHF, *etc*. It also provides 30+ directly runnable scripts, making it suitable for beginners to quickly get started.
1912
+
1913
+ Best Practices: [MiniCPM-o 2.6](https://github.com/PKU-Alignment/align-anything/tree/main/scripts).
1914
+
1915
+
1916
+ ### With LLaMA-Factory <!-- omit in toc -->
1917
+
1918
+ We support fine-tuning MiniCPM-o 2.6 and MiniCPM-V 2.6 with the LLaMA-Factory framework. LLaMA-Factory provides a solution for flexibly customizing the fine-tuning (Lora/Full/Qlora) of 200+ LLMs without the need for coding through the built-in web UI LLaMABoard. It supports various training methods like sft/ppo/dpo/kto and advanced algorithms like Galore/BAdam/LLaMA-Pro/Pissa/LongLoRA.
1919
+
1920
+
1921
+ Best Practices: [MiniCPM-o 2.6 | MiniCPM-V 2.6](./docs/llamafactory_train_and_infer.md).
1922
+
1923
+
1924
+ ### With the SWIFT Framework <!-- omit in toc -->
1925
+
1926
+ We now support MiniCPM-V series fine-tuning with the SWIFT framework. SWIFT supports training, inference, evaluation and deployment of nearly 200 LLMs and MLLMs . It supports the lightweight training solutions provided by PEFT and a complete Adapters Library including techniques such as NEFTune, LoRA+ and LLaMA-PRO.
1927
+
1928
+ Best Practices:[MiniCPM-V 1.0](https://github.com/modelscope/swift/blob/main/docs/source/Multi-Modal/minicpm-v最佳实践.md), [MiniCPM-V 2.0](https://github.com/modelscope/swift/blob/main/docs/source/Multi-Modal/minicpm-v-2最佳实践.md), [MiniCPM-V 2.6](https://github.com/modelscope/ms-swift/issues/1613).
1929
+
1930
+
1931
+ ## Awesome work using MiniCPM-V & MiniCPM-o
1932
+ - [text-extract-api](https://github.com/CatchTheTornado/text-extract-api): Document extraction API using OCRs and Ollama supported models ![GitHub Repo stars](https://img.shields.io/github/stars/CatchTheTornado/text-extract-api)
1933
+ - [comfyui_LLM_party](https://github.com/heshengtao/comfyui_LLM_party): Build LLM workflows and integrate into existing image workflows ![GitHub Repo stars](https://img.shields.io/github/stars/heshengtao/comfyui_LLM_party)
1934
+ - [Ollama-OCR](https://github.com/imanoop7/Ollama-OCR): OCR package uses vlms through Ollama to extract text from images and PDF ![GitHub Repo stars](https://img.shields.io/github/stars/imanoop7/Ollama-OCR)
1935
+ - [comfyui-mixlab-nodes](https://github.com/MixLabPro/comfyui-mixlab-nodes): ComfyUI node suite supports Workflow-to-APP、GPT&3D and more ![GitHub Repo stars](https://img.shields.io/github/stars/MixLabPro/comfyui-mixlab-nodes)
1936
+ - [OpenAvatarChat](https://github.com/HumanAIGC-Engineering/OpenAvatarChat): Interactive digital human conversation implementation on single PC ![GitHub Repo stars](https://img.shields.io/github/stars/HumanAIGC-Engineering/OpenAvatarChat)
1937
+ - [pensieve](https://github.com/arkohut/pensieve): A privacy-focused passive recording project by recording screen content ![GitHub Repo stars](https://img.shields.io/github/stars/arkohut/pensieve)
1938
+ - [paperless-gpt](https://github.com/icereed/paperless-gpt): Use LLMs to handle paperless-ngx, AI-powered titles, tags and OCR ![GitHub Repo stars](https://img.shields.io/github/stars/icereed/paperless-gpt)
1939
+ - [Neuro](https://github.com/kimjammer/Neuro): A recreation of Neuro-Sama, but running on local models on consumer hardware ![GitHub Repo stars](https://img.shields.io/github/stars/kimjammer/Neuro)
1940
+
1941
+ ## FAQs
1942
+ Click here to view the [FAQs](./docs/faqs.md)
1943
+
1944
+ ## Limitations
1945
+ As an experimental trial, we find MiniCPM-o 2.6 has notable limitations worth further investigation and improvement.
1946
+ - **Unstable speech output.** The speech generation can be flawed with noisy backgrounds and unmeaningful sounds.
1947
+ - **Repeated response.** The model tends to repeat its response when encountering similar consecutive user queries.
1948
+ - **High-latency on Web Demo.** Users may experience unusual high-latency when using web demo hosted on overseas servers. We recommend deploying the demo locally or with good network connections.
1949
+
1950
+ ## Model License <!-- omit in toc -->
1951
+
1952
  * The MiniCPM-o/V model weights and code are open-sourced under the [Apache-2.0](https://github.com/OpenBMB/MiniCPM-V/blob/main/LICENSE) license.
1953
+
1954
  * To help us better understand and support our users, we would deeply appreciate it if you could consider optionally filling out a brief registration ["questionnaire"](https://modelbest.feishu.cn/share/base/form/shrcnpV5ZT9EJ6xYjh3Kx0J6v8g).
1955
 
1956
+ ## Statement <!-- omit in toc -->
 
 
1957
 
1958
+ As MLLMs, MiniCPM-o/V models generate content by learning a large number of multimodal corpora, but they cannot comprehend, express personal opinions, or make value judgements. Anything generated by MiniCPM-o/V models does not represent the views and positions of the model developers
1959
 
1960
+ We will not be liable for any problems arising from the use of MiniCPM-o/V models, including but not limited to data security issues, risk of public opinion, or any risks and problems arising from the misdirection, misuse, dissemination, or misuse of the model.
1961
 
 
1962
 
1963
+ ## Institutions <!-- omit in toc -->
1964
 
1965
+ This project is developed by the following institutions:
1966
 
1967
+ - <img src="https://raw.githubusercontent.com/OpenBMB/MiniCPM-o/main/assets/thunlp.png" width="28px"> [THUNLP](https://nlp.csai.tsinghua.edu.cn/)
1968
+ - <img src="https://raw.githubusercontent.com/OpenBMB/MiniCPM-o/main/assets/modelbest.png" width="28px"> [ModelBest](https://modelbest.cn/)
1969
+
1970
+ ## 🌟 Star History <!-- omit in toc -->
1971
+
1972
+
1973
+ <table align="center">
1974
+ <p align="center">
1975
+ <img src="https://raw.githubusercontent.com/OpenBMB/MiniCPM-o/main/assets/star-history-25-09-02.png"/>
1976
+ </p>
1977
+ </table>
1978
+
1979
+ <!-- <picture>
1980
+ <source
1981
+ media="(prefers-color-scheme: dark)"
1982
+ srcset="
1983
+ https://api.star-history.com/svg?repos=OpenBMB/MiniCPM-o&type=Date&theme=dark
1984
+ "
1985
+ />
1986
+ <source
1987
+ media="(prefers-color-scheme: light)"
1988
+ srcset="
1989
+ https://api.star-history.com/svg?repos=OpenBMB/MiniCPM-o&type=Date
1990
+ "
1991
+ />
1992
+ <img
1993
+ alt="Star History Chart"
1994
+ src="https://api.star-history.com/svg?repos=OpenBMB/MiniCPM-o&type=Date"
1995
+ />
1996
+ </picture> -->
1997
 
1998
+ ## Key Techniques and Other Multimodal Projects <!-- omit in toc -->
1999
+
2000
+ 👏 Welcome to explore key techniques of MiniCPM-o/V and other multimodal projects of our team:
2001
+
2002
+ [VisCPM](https://github.com/OpenBMB/VisCPM/tree/main) | [RLPR](https://github.com/OpenBMB/RLPR) | [RLHF-V](https://github.com/RLHF-V/RLHF-V) | [LLaVA-UHD](https://github.com/thunlp/LLaVA-UHD) | [RLAIF-V](https://github.com/RLHF-V/RLAIF-V)
2003
+
2004
+
2005
+ ## Citation <!-- omit in toc -->
2006
+
2007
+ If you find our model/code/paper helpful, please consider citing our papers 📝 and staring us ⭐️!
2008
+
2009
+ ```bib
2010
  @article{yao2024minicpm,
2011
  title={MiniCPM-V: A GPT-4V Level MLLM on Your Phone},
2012
  author={Yao, Yuan and Yu, Tianyu and Zhang, Ao and Wang, Chongyi and Cui, Junbo and Zhu, Hongji and Cai, Tianchi and Li, Haoyu and Zhao, Weilin and He, Zhihui and others},
2013
+ journal={arXiv preprint arXiv:2408.01800},
2014
+ year={2024}
2015
  }
 
2016
  ```