Improve model card title, pipeline tag, and GitHub link

#3
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +1811 -306
README.md CHANGED
@@ -1,10 +1,11 @@
1
  ---
2
- pipeline_tag: image-text-to-text
3
  datasets:
4
  - openbmb/RLAIF-V-Dataset
5
- library_name: transformers
6
  language:
7
  - multilingual
 
 
 
8
  tags:
9
  - minicpm-v
10
  - vision
@@ -12,12 +13,13 @@ tags:
12
  - multi-image
13
  - video
14
  - custom_code
15
- license: apache-2.0
16
  ---
17
 
18
- <h1>A GPT-4o Level MLLM for Single Image, Multi Image and High-FPS Video Understanding on Your Phone</h1>
19
 
20
- [GitHub](https://github.com/OpenBMB/MiniCPM-o) | [CookBook](https://github.com/OpenSQZ/MiniCPM-V-CookBook) | [Technical Report](https://huggingface.co/papers/2509.18154) | [Demo](http://101.126.42.235:30910/) </a>
 
 
21
 
22
 
23
 
@@ -50,7 +52,7 @@ MiniCPM-V 4.5 can be easily used in various ways: (1) [llama.cpp](https://github
50
 
51
  - **Pre-training: Unified Learning for OCR and Knowledge from Documents.** Existing MLLMs learn OCR capability and knowledge from documents in isolated training approaches. We observe that the essential difference between these two training approaches is the visibility of the text in images. By dynamically corrupting text regions in documents with varying noise levels and asking the model to reconstruct the text, the model learns to adaptively and properly switch between accurate text recognition (when text is visible) and multimodal context-based knowledge reasoning (when text is heavily obscured). This eliminates reliance on error-prone document parsers in knowledge learning from documents, and prevents hallucinations from over-augmented OCR data, resulting in top-tier OCR and multimodal knowledge performance with minimal engineering overhead.
52
 
53
- - **Post-training: Hybrid Fast/Deep Thinking with Multimodal RL.** MiniCPM-V 4.5 offers a balanced reasoning experience through two switchable modes: fast thinking for efficient daily use and deep thinking for complex tasks. Using a new hybrid reinforcement learning method, the model jointly optimizes both modes, significantly enhancing fast-mode performance without compromising deep-mode capability. Incorporated with [RLPR](https://github.com/OpenBMB/RLPR) and [RLAIF-V](https://github.com/RLHF-V/RLAIF-V), it generalizes robust reasoning skills from broad multimodal data while effectively reducing hallucinations.
54
 
55
  ### Evaluation
56
 
@@ -150,362 +152,1865 @@ Both Video-MME and OpenCompass were evaluated using 8×A100 GPUs for inference.
150
  <img src="https://raw.githubusercontent.com/openbmb/MiniCPM-o/main/assets/minicpmv4_5/en_case3.jpeg" alt="en_case3" style="margin-bottom: 5px;">
151
  </div>
152
 
153
- We deploy MiniCPM-V 4.5 on iPad M4 with [iOS demo](https://github.com/tc-mb/MiniCPM-o-demo-iOS). The demo video is the raw screen recording without editing.
154
-
155
- <div align="center">
156
- <img src="https://raw.githubusercontent.com/openbmb/MiniCPM-o/main/assets/minicpmv4_5/v45_en_handwriting.gif" width="45%" style="display: inline-block; margin: 0 10px;"/>
157
- <img src="https://raw.githubusercontent.com/openbmb/MiniCPM-o/main/assets/minicpmv4_5/v45_en_cot.gif" width="45%" style="display: inline-block; margin: 0 10px;"/>
158
  </div>
159
 
160
- <div align="center">
161
- <img src="https://raw.githubusercontent.com/openbmb/MiniCPM-o/main/assets/minicpmv4_5/v45_cn_handwriting.gif" width="45%" style="display: inline-block; margin: 0 10px;"/>
162
- <img src="https://raw.githubusercontent.com/openbmb/MiniCPM-o/main/assets/minicpmv4_5/v45_cn_travel.gif" width="45%" style="display: inline-block; margin: 0 10px;"/>
163
- </div>
164
-
165
- ## Framework Support Matrix
166
- <table>
167
- <thead>
168
- <tr>
169
- <th>Category</th>
170
- <th>Framework</th>
171
- <th>Cookbook Link</th>
172
- <th>Upstream PR</th>
173
- <th>Supported since(branch)</th>
174
- <th>Supported since(release)</th>
175
- </tr>
176
- </thead>
177
- <tbody>
178
- <tr>
179
- <td rowspan="2">Edge(On-device)</td>
180
- <td>Llama.cpp</td>
181
- <td><a href="https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/deployment/llama.cpp/minicpm-v4_5_llamacpp.md">Llama.cpp Doc</a></td>
182
- <td><a href="https://github.com/ggml-org/llama.cpp/pull/15575">#15575</a>(2025-08-26)</td>
183
- <td>master(2025-08-26)</td>
184
- <td><a href="https://github.com/ggml-org/llama.cpp/releases/tag/b6282">b6282</a></td>
185
- </tr>
186
- <tr>
187
- <td>Ollama</td>
188
- <td><a href="https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/deployment/ollama/minicpm-v4_5_ollama.md">Ollama Doc</a></td>
189
- <td><a href="https://github.com/ollama/ollama/pull/12078">#12078</a>(2025-08-26)</td>
190
- <td>Merging</td>
191
- <td>Waiting for official release</td>
192
- </tr>
193
- <tr>
194
- <td rowspan="2">Serving(Cloud)</td>
195
- <td>vLLM</td>
196
- <td><a href="https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/deployment/vllm/minicpm-v4_5_vllm.md">vLLM Doc</a></td>
197
- <td><a href="https://github.com/vllm-project/vllm/pull/23586">#23586</a>(2025-08-26)</td>
198
- <td>main(2025-08-27)</td>
199
- <td><a href="https://github.com/vllm-project/vllm/releases/tag/v0.10.2">v0.10.2</td>
200
- </tr>
201
- <tr>
202
- <td>SGLang</td>
203
- <td><a href="https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/deployment/sglang/MiniCPM-v4_5_sglang.md">SGLang Doc</a></td>
204
- <td><a href="https://github.com/sgl-project/sglang/pull/9610">#9610</a>(2025-08-26)</td>
205
- <td>Merging</td>
206
- <td>Waiting for official release</td>
207
- </tr>
208
- <tr>
209
- <td>Finetuning</td>
210
- <td>LLaMA-Factory</td>
211
- <td><a href="https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/finetune/finetune_llamafactory.md">LLaMA-Factory Doc</a></td>
212
- <td><a href="https://github.com/hiyouga/LLaMA-Factory/pull/9022">#9022</a>(2025-08-26)</td>
213
- <td>main(2025-08-26)</td>
214
- <td>Waiting for official release</td>
215
- </tr>
216
- <tr>
217
- <td rowspan="3">Quantization</td>
218
- <td>GGUF</td>
219
- <td><a href="https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/quantization/gguf/minicpm-v4_5_gguf_quantize.md">GGUF Doc</a></td>
220
- <td>—</td>
221
- <td>—</td>
222
- <td>—</td>
223
- </tr>
224
- <tr>
225
- <td>BNB</td>
226
- <td><a href="https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/quantization/bnb/minicpm-v4_5_bnb_quantize.md">BNB Doc</a></td>
227
- <td>—</td>
228
- <td>—</td>
229
- <td>—</td>
230
- </tr>
231
- <tr>
232
- <td>AWQ</td>
233
- <td><a href="https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/quantization/awq/minicpm-v4_5_awq_quantize.md">AWQ Doc</a></td>
234
- <td>—</td>
235
- <td>—</td>
236
- <td>—</td>
237
- </tr>
238
- <tr>
239
- <td>Demos</td>
240
- <td>Gradio Demo</td>
241
- <td><a href="https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/demo/web_demo/gradio/README.md">Gradio Demo Doc</a></td>
242
- <td>—</td>
243
- <td>—</td>
244
- <td>—</td>
245
- </tr>
246
- </tbody>
247
- </table>
248
-
249
- > Note: If you'd like us to prioritize support for another open-source framework, please let us know via this [short form](https://docs.google.com/forms/d/e/1FAIpQLSdyTUrOPBgWqPexs3ORrg47ZcZ1r4vFQaA4ve2iA7L9sMfMWw/viewform).
250
-
251
- ## Usage
252
-
253
- If you wish to enable thinking mode, provide the argument `enable_thinking=True` to the chat function.
254
-
255
- #### Chat with Image
256
- ```python
257
- import torch
258
- from PIL import Image
259
- from transformers import AutoModel, AutoTokenizer
260
-
261
- torch.manual_seed(100)
262
-
263
- model = AutoModel.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True, # or openbmb/MiniCPM-o-2_6
264
- attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
265
- model = model.eval().cuda()
266
- tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True) # or openbmb/MiniCPM-o-2_6
267
-
268
- image = Image.open('./assets/minicpmo2_6/show_demo.jpg').convert('RGB')
269
-
270
- enable_thinking=False # If `enable_thinking=True`, the thinking mode is enabled.
271
- stream=True # If `stream=True`, the answer is string
272
-
273
- # First round chat
274
- question = "What is the landform in the picture?"
275
- msgs = [{'role': 'user', 'content': [image, question]}]
276
-
277
- answer = model.chat(
278
- msgs=msgs,
279
- tokenizer=tokenizer,
280
- enable_thinking=enable_thinking,
281
- stream=True
282
- )
283
-
284
- generated_text = ""
285
- for new_text in answer:
286
- generated_text += new_text
287
- print(new_text, flush=True, end='')
288
-
289
- # Second round chat, pass history context of multi-turn conversation
290
- msgs.append({"role": "assistant", "content": [generated_text]})
291
- msgs.append({"role": "user", "content": ["What should I pay attention to when traveling here?"]})
292
-
293
- answer = model.chat(
294
- msgs=msgs,
295
- tokenizer=tokenizer,
296
- stream=True
297
- )
298
-
299
- generated_text = ""
300
- for new_text in answer:
301
- generated_text += new_text
302
- print(new_text, flush=True, end='')
303
- ```
304
-
305
- You will get the following output:
306
-
307
- ```shell
308
- # round1
309
- The landform in the picture is karst topography. Karst landscapes are characterized by distinctive, jagged limestone hills or mountains with steep, irregular peaks and deep valleys—exactly what you see here These unique formations result from the dissolution of soluble rocks like limestone over millions of years through water erosion.
310
-
311
- This scene closely resembles the famous karst landscape of Guilin and Yangshuo in China’s Guangxi Province. The area features dramatic, pointed limestone peaks rising dramatically above serene rivers and lush green forests, creating a breathtaking and iconic natural beauty that attracts millions of visitors each year for its picturesque views.
312
-
313
- # round2
314
- When traveling to a karst landscape like this, here are some important tips:
315
 
316
- 1. Wear comfortable shoes: The terrain can be uneven and hilly.
317
- 2. Bring water and snacks for energy during hikes or boat rides.
318
- 3. Protect yourself from the sun with sunscreen, hats, and sunglasses—especially since you’ll likely spend time outdoors exploring scenic spots.
319
- 4. Respect local customs and nature regulations by not littering or disturbing wildlife.
320
 
321
- By following these guidelines, you'll have a safe and enjoyable trip while appreciating the stunning natural beauty of places such as Guilin’s karst mountains.
322
- ```
 
 
 
 
 
 
 
 
 
 
323
 
 
324
 
325
- #### Chat with Video
326
 
327
- ```python
328
- ## The 3d-resampler compresses multiple frames into 64 tokens by introducing temporal_ids.
329
- # To achieve this, you need to organize your video data into two corresponding sequences:
330
- # frames: List[Image]
331
- # temporal_ids: List[List[Int]].
332
 
333
- import torch
334
- from PIL import Image
335
- from transformers import AutoModel, AutoTokenizer
336
- from decord import VideoReader, cpu # pip install decord
337
- from scipy.spatial import cKDTree
338
- import numpy as np
339
- import math
340
 
341
- model = AutoModel.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True, # or openbmb/MiniCPM-o-2_6
342
- attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
343
- model = model.eval().cuda()
344
- tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True) # or openbmb/MiniCPM-o-2_6
345
 
346
- MAX_NUM_FRAMES=180 # Indicates the maximum number of frames received after the videos are packed. The actual maximum number of valid frames is MAX_NUM_FRAMES * MAX_NUM_PACKING.
347
- MAX_NUM_PACKING=3 # indicates the maximum packing number of video frames. valid range: 1-6
348
- TIME_SCALE = 0.1
349
 
350
- def map_to_nearest_scale(values, scale):
351
- tree = cKDTree(np.asarray(scale)[:, None])
352
- _, indices = tree.query(np.asarray(values)[:, None])
353
- return np.asarray(scale)[indices]
354
 
 
 
355
 
356
- def group_array(arr, size):
357
- return [arr[i:i+size] for i in range(0, len(arr), size)]
358
 
359
- def encode_video(video_path, choose_fps=3, force_packing=None):
360
- def uniform_sample(l, n):
361
- gap = len(l) / n
362
- idxs = [int(i * gap + gap / 2) for i in range(n)]
363
- return [l[i] for i in idxs]
364
- vr = VideoReader(video_path, ctx=cpu(0))
365
- fps = vr.get_avg_fps()
366
- video_duration = len(vr) / fps
367
-
368
- if choose_fps * int(video_duration) <= MAX_NUM_FRAMES:
369
- packing_nums = 1
370
- choose_frames = round(min(choose_fps, round(fps)) * min(MAX_NUM_FRAMES, video_duration))
371
-
372
- else:
373
- packing_nums = math.ceil(video_duration * choose_fps / MAX_NUM_FRAMES)
374
- if packing_nums <= MAX_NUM_PACKING:
375
- choose_frames = round(video_duration * choose_fps)
376
- else:
377
- choose_frames = round(MAX_NUM_FRAMES * MAX_NUM_PACKING)
378
- packing_nums = MAX_NUM_PACKING
379
 
380
- frame_idx = [i for i in range(0, len(vr))]
381
- frame_idx = np.array(uniform_sample(frame_idx, choose_frames))
 
382
 
383
- if force_packing:
384
- packing_nums = min(force_packing, MAX_NUM_PACKING)
385
-
386
- print(video_path, ' duration:', video_duration)
387
- print(f'get video frames={len(frame_idx)}, packing_nums={packing_nums}')
388
-
389
- frames = vr.get_batch(frame_idx).asnumpy()
390
 
391
- frame_idx_ts = frame_idx / fps
392
- scale = np.arange(0, video_duration, TIME_SCALE)
393
 
394
- frame_ts_id = map_to_nearest_scale(frame_idx_ts, scale) / TIME_SCALE
395
- frame_ts_id = frame_ts_id.astype(np.int32)
396
 
397
- assert len(frames) == len(frame_ts_id)
 
 
398
 
399
- frames = [Image.fromarray(v.astype('uint8')).convert('RGB') for v in frames]
400
- frame_ts_id_group = group_array(frame_ts_id, packing_nums)
401
-
402
- return frames, frame_ts_id_group
403
 
 
404
 
405
- video_path="video_test.mp4"
406
- fps = 5 # fps for video
407
- force_packing = None # You can set force_packing to ensure that 3D packing is forcibly enabled; otherwise, encode_video will dynamically set the packing quantity based on the duration.
408
- frames, frame_ts_id_group = encode_video(video_path, fps, force_packing=force_packing)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
409
 
410
- question = "Describe the video"
411
- msgs = [
412
- {'role': 'user', 'content': frames + [question]},
413
- ]
414
 
 
415
 
416
- answer = model.chat(
417
- msgs=msgs,
418
- tokenizer=tokenizer,
419
- use_image_id=False,
420
- max_slice_nums=1,
421
- temporal_ids=frame_ts_id_group
422
- )
423
- print(answer)
424
- ```
425
 
426
- #### Chat with multiple images
427
- <details>
428
- <summary> Click to show Python code running MiniCPM-V 4.5 with multiple images input. </summary>
429
-
430
- ```python
431
- import torch
432
- from PIL import Image
433
- from transformers import AutoModel, AutoTokenizer
434
 
435
- model = AutoModel.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True,
436
- attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2
437
- model = model.eval().cuda()
438
- tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True)
439
 
440
- image1 = Image.open('image1.jpg').convert('RGB')
441
- image2 = Image.open('image2.jpg').convert('RGB')
442
- question = 'Compare image 1 and image 2, tell me about the differences between image 1 and image 2.'
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
443
 
444
- msgs = [{'role': 'user', 'content': [image1, image2, question]}]
 
445
 
446
- answer = model.chat(
447
- msgs=msgs,
448
- tokenizer=tokenizer
449
- )
450
- print(answer)
451
- ```
452
  </details>
453
 
454
 
455
- #### In-context few-shot learning
456
  <details>
457
- <summary> Click to view Python code running MiniCPM-V 4.5 with few-shot input. </summary>
458
 
459
- ```python
460
- import torch
461
- from PIL import Image
462
- from transformers import AutoModel, AutoTokenizer
463
 
464
- model = AutoModel.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True,
465
- attn_implementation='sdpa', torch_dtype=torch.bfloat16)
466
- model = model.eval().cuda()
467
- tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
468
 
469
- question = "production date"
470
- image1 = Image.open('example1.jpg').convert('RGB')
471
- answer1 = "2023.08.04"
472
- image2 = Image.open('example2.jpg').convert('RGB')
473
- answer2 = "2007.04.24"
474
- image_test = Image.open('test.jpg').convert('RGB')
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
475
 
 
476
  msgs = [
477
- {'role': 'user', 'content': [image1, question]}, {'role': 'assistant', 'content': [answer1]},
478
- {'role': 'user', 'content': [image2, question]}, {'role': 'assistant', 'content': [answer2]},
479
- {'role': 'user', 'content': [image_test, question]}
480
  ]
481
 
 
482
  answer = model.chat(
483
  msgs=msgs,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
484
  tokenizer=tokenizer
485
  )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
486
  print(answer)
487
  ```
 
 
 
 
488
  </details>
489
 
490
 
491
- ## License
492
- #### Model License
493
- * The MiniCPM-o/V model weights and code are open-sourced under the [Apache-2.0](https://github.com/OpenBMB/MiniCPM-V/blob/main/LICENSE) license.
494
- * To help us better understand and support our users, we would deeply appreciate it if you could consider optionally filling out a brief registration ["questionnaire"](https://modelbest.feishu.cn/share/base/form/shrcnpV5ZT9EJ6xYjh3Kx0J6v8g).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
495
 
496
- #### Statement
497
- * As an LMM, MiniCPM-V 4.5 generates contents by learning a large amount of multimodal corpora, but it cannot comprehend, express personal opinions or make value judgement. Anything generated by MiniCPM-V 4.5 does not represent the views and positions of the model developers
498
- * We will not be liable for any problems arising from the use of the MinCPM-V models, including but not limited to data security issues, risk of public opinion, or any risks and problems arising from the misdirection, misuse, dissemination or misuse of the model.
499
 
500
- ## Key Techniques and Other Multimodal Projects
501
 
502
- 👏 Welcome to explore key techniques of MiniCPM-V 4.5 and other multimodal projects of our team:
503
 
504
- [VisCPM](https://github.com/OpenBMB/VisCPM/tree/main) | [RLPR](https://github.com/OpenBMB/RLPR) | [RLHF-V](https://github.com/RLHF-V/RLHF-V) | [LLaVA-UHD](https://github.com/thunlp/LLaVA-UHD) | [RLAIF-V](https://github.com/RLHF-V/RLAIF-V)
505
 
506
- ## Citation
507
 
508
- If you find our work helpful, please consider citing our papers 📝 and liking this project ❤️!
509
 
510
  ```bib
511
  @misc{yu2025minicpmv45cookingefficient,
 
1
  ---
 
2
  datasets:
3
  - openbmb/RLAIF-V-Dataset
 
4
  language:
5
  - multilingual
6
+ library_name: transformers
7
+ license: apache-2.0
8
+ pipeline_tag: video-text-to-text
9
  tags:
10
  - minicpm-v
11
  - vision
 
13
  - multi-image
14
  - video
15
  - custom_code
 
16
  ---
17
 
18
+ # MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe
19
 
20
+ A GPT-4o Level MLLM for Single Image, Multi Image and High-FPS Video Understanding on Your Phone.
21
+
22
+ [GitHub](https://github.com/OpenBMB/MiniCPM-V) | [CookBook](https://github.com/OpenSQZ/MiniCPM-V-CookBook) | [Technical Report](https://huggingface.co/papers/2509.18154) | [Demo](http://101.126.42.235:30910/) </a>
23
 
24
 
25
 
 
52
 
53
  - **Pre-training: Unified Learning for OCR and Knowledge from Documents.** Existing MLLMs learn OCR capability and knowledge from documents in isolated training approaches. We observe that the essential difference between these two training approaches is the visibility of the text in images. By dynamically corrupting text regions in documents with varying noise levels and asking the model to reconstruct the text, the model learns to adaptively and properly switch between accurate text recognition (when text is visible) and multimodal context-based knowledge reasoning (when text is heavily obscured). This eliminates reliance on error-prone document parsers in knowledge learning from documents, and prevents hallucinations from over-augmented OCR data, resulting in top-tier OCR and multimodal knowledge performance with minimal engineering overhead.
54
 
55
+ - **Post-training: Hybrid Fast/Deep Thinking with Multimodal RL.** MiniCPM-V 4.5 offers a balanced reasoning experience through two switchable modes: fast thinking for efficient daily use and deep thinking for more complex tasks. Using a new hybrid reinforcement learning method, the model jointly optimizes both modes, significantly enhancing fast-mode performance without compromising deep-mode capability. Incorporated with [RLPR](https://github.com/OpenBMB/RLPR) and [RLAIF-V](https://github.com/RLHF-V/RLAIF-V), it generalizes robust reasoning skills from broad multimodal data while effectively reducing hallucinations.
56
 
57
  ### Evaluation
58
 
 
152
  <img src="https://raw.githubusercontent.com/openbmb/MiniCPM-o/main/assets/minicpmv4_5/en_case3.jpeg" alt="en_case3" style="margin-bottom: 5px;">
153
  </div>
154
 
155
+ <details>
156
+ <summary>Click to view more cases.</summary>
157
+ <div style="display: flex; flex-direction: column; align-items: center;">
158
+ <img src="https://raw.githubusercontent.com/openbmb/MiniCPM-o/main/assets/minicpmv4_5/zh_extra.jpeg" alt="zh_extra" style="margin-bottom: 5px;">
 
159
  </div>
160
 
161
+ </details>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
162
 
163
+ We deploy MiniCPM-V 4.5 on iPad M4 with [iOS demo](https://github.com/tc-mb/MiniCPM-o-demo-iOS). The demo video is the raw screen recording without editing.
 
 
 
164
 
165
+ <table align="center">
166
+ <p align="center">
167
+ <img src="assets/minicpmv4_5/v45_en_handwriting.gif" width=45%/>
168
+ &nbsp;&nbsp;&nbsp;&nbsp;
169
+ <img src="assets/minicpmv4_5/v45_en_cot.gif" width=45%/>
170
+ </p>
171
+ <p align="center">
172
+ <img src="assets/minicpmv4_5/v45_cn_handwriting.gif" width=45%/>
173
+ &nbsp;&nbsp;&nbsp;&nbsp;
174
+ <img src="assets/minicpmv4_5/v45_cn_travel.gif" width=45%/>
175
+ </p>
176
+ </table>
177
 
178
+ ## MiniCPM-o 2.6
179
 
180
+ **MiniCPM-o 2.6** is the latest and most capable model in the MiniCPM-o series. The model is built in an end-to-end fashion based on SigLip-400M, Whisper-medium-300M, ChatTTS-200M, and Qwen2.5-7B with a total of 8B parameters. It exhibits a significant performance improvement over MiniCPM-V 2.6, and introduces new features for real-time speech conversation and multimodal live streaming. Notable features of MiniCPM-o 2.6 include:
181
 
182
+ - 🔥 **Leading Visual Capability.**
183
+ MiniCPM-o 2.6 achieves an average score of 70.2 on OpenCompass, a comprehensive evaluation of 8 popular benchmarks. **With only 8B parameters, it surpasses widely used proprietary models like GPT-4o-202405, Gemini 1.5 Pro, and Claude 3.5 Sonnet** for single image understanding. It also **outperforms GPT-4V and Claude 3.5 Sonnet** in multi-image and video understanding, and shows promising in-context learning capability.
 
 
 
184
 
185
+ - 🎙 **State-of-the-art Speech Capability.** MiniCPM-o 2.6 supports **bilingual real-time speech conversation with configurable voices** in English and Chinese. It **outperforms GPT-4o-realtime on audio understanding tasks** such as ASR and STT translation, and shows **state-of-the-art performance on speech conversation in both semantic and acoustic evaluations in the open-source community**. It also allows for fun features such as emotion/speed/style control, end-to-end voice cloning, role play, etc.
 
 
 
 
 
 
186
 
187
+ - 🎬 **Strong Multimodal Live Streaming Capability.** As a new feature, MiniCPM-o 2.6 can **accept continuous video and audio streams independent of user queries, and support real-time speech interaction**. It **outperforms GPT-4o-202408 and Claude 3.5 Sonnet and shows state-of-the-art performance in the open-source community on StreamingBench**, a comprehensive benchmark for real-time video understanding, omni-source (video & audio) understanding, and multimodal contextual understanding.
 
 
 
188
 
189
+ - 💪 **Strong OCR Capability and Others.**
190
+ Advancing popular visual capabilities from MiniCPM-V series, MiniCPM-o 2.6 can process images with any aspect ratio and up to 1.8 million pixels (e.g., 1344x1344). It achieves **state-of-the-art performance on OCRBench for models under 25B, surpassing proprietary models such as GPT-4o-202405**.
191
+ Based on the latest [RLAIF-V](https://github.com/RLHF-V/RLAIF-V/) and [VisCPM](https://github.com/OpenBMB/VisCPM) techniques, it features **trustworthy behaviors**, outperforming GPT-4o and Claude 3.5 Sonnet on MMHal-Bench, and supports **multilingual capabilities** on more than 30 languages.
192
 
 
 
 
 
193
 
194
+ - 🚀 **Superior Efficiency.**
195
+ In addition to its friendly size, MiniCPM-o 2.6 also shows **state-of-the-art token density** (i.e., the number of pixels encoded into each visual token). **It produces only 640 tokens when processing a 1.8M pixel image, which is 75% fewer than most models**. This directly improves the inference speed, first-token latency, memory usage, and power consumption. As a result, MiniCPM-o 2.6 can efficiently support **multimodal live streaming** on end-side devices such as iPads.
196
 
197
+ - 💫 **Easy Usage.**
198
+ MiniCPM-o 2.6 can be easily used in various ways: (1) [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-omni/examples/llava/README-minicpmo2.6.md) support for efficient CPU inference on local devices, (2) [int4](https://huggingface.co/openbmb/MiniCPM-o-2_6-int4) and [GGUF](https://huggingface.co/openbmb/MiniCPM-o-2_6-gguf) format quantized models in 16 sizes, (3) [vLLM](#efficient-inference-with-llamacpp-ollama-vllm) support for high-throughput and memory-efficient inference, (4) fine-tuning on new domains and tasks with [LLaMA-Factory](./docs/llamafactory_train_and_infer.md), (5) quick [local WebUI demo](#chat-with-our-demo-on-gradio), and (6) online web demo on [server](https://minicpm-omni-webdemo-us.modelbest.cn/).
199
 
200
+ **Model Architecture.**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
201
 
202
+ - **End-to-end Omni-modal Architecture.** Different modality encoders/decoders are connected and trained in an **end-to-end** fashion to fully exploit rich multimodal knowledge. The model is trained in a fully end-to-end manner with only CE loss.
203
+ - **Omni-modal Live Streaming Mechanism.** (1) We change the offline modality encoder/decoders into online ones for **streaming inputs/outputs.** (2) We devise a **time-division multiplexing (TDM) mechanism** for omni-modality streaming processing in the LLM backbone. It divides parallel omni-modality streams into sequential info within small periodic time slices.
204
+ - **Configurable Speech Modeling Design.** We devise a multimodal system prompt, including traditional text system prompt, and **a new audio system prompt to determine the assistant voice**. This enables flexible voice configurations in inference time, and also facilitates end-to-end voice cloning and description-based voice creation.
205
 
206
+ <div align="center">
207
+ <img src="./assets/minicpm-o-26-framework-v2.png" , width=80%>
208
+ </div>
 
 
 
 
209
 
 
 
210
 
211
+ ### Evaluation <!-- omit in toc -->
 
212
 
213
+ <div align="center">
214
+ <img src="./assets/radar.jpg", width=80%>
215
+ </div>
216
 
217
+ <details>
218
+ <summary>Click to view visual understanding results.</summary>
 
 
219
 
220
+ **Image Understanding**
221
 
222
+ <div align="center">
223
+ <table style="margin: 0px auto;">
224
+ <thead>
225
+ <tr>
226
+ <th align="left">Model</th>
227
+ <th>Size</th>
228
+ <th>Token Density<sup>+</sup></th>
229
+ <th>OpenCompass</th>
230
+ <th>OCRBench</th>
231
+ <th>MathVista mini</th>
232
+ <th>ChartQA</th>
233
+ <th>MMVet</th>
234
+ <th>MMStar</th>
235
+ <th>MME</th>
236
+ <th>MMB1.1 test</th>
237
+ <th>AI2D</th>
238
+ <th>MMMU val</th>
239
+ <th>HallusionBench</th>
240
+ <th>TextVQA val</th>
241
+ <th>DocVQA test</th>
242
+ <th>MathVerse mini</th>
243
+ <th>MathVision</th>
244
+ <th>MMHal Score</th>
245
+ </tr>
246
+ </thead>
247
+ <tbody align="center">
248
+ <tr>
249
+ <td colspan="19" align="left"><strong>Proprietary</strong></td>
250
+ </tr>
251
+ <tr>
252
+ <td nowrap="nowrap" align="left">GPT-4o-20240513</td>
253
+ <td>-</td>
254
+ <td>1088</td>
255
+ <td><u>69.9</u></td>
256
+ <td>736</td>
257
+ <td>61.3</td>
258
+ <td>85.7</td>
259
+ <td><strong>69.1</strong></td>
260
+ <td>63.9</td>
261
+ <td>2328.7</td>
262
+ <td>82.2</td>
263
+ <td>84.6</td>
264
+ <td><strong>69.2</strong></td>
265
+ <td><strong>55.0</strong></td>
266
+ <td>-</td>
267
+ <td>92.8</td>
268
+ <td><strong>50.2</strong></td>
269
+ <td><strong>30.4</strong></td>
270
+ <td><u>3.6</u></td>
271
+ </tr>
272
+ <tr>
273
+ <td nowrap="nowrap" align="left">Claude3.5-Sonnet</td>
274
+ <td>-</td>
275
+ <td>750</td>
276
+ <td>67.9</td>
277
+ <td>788</td>
278
+ <td>61.6</td>
279
+ <td><strong>90.8</strong></td>
280
+ <td>66.0</td>
281
+ <td>62.2</td>
282
+ <td>1920.0</td>
283
+ <td>78.5</td>
284
+ <td>80.2</td>
285
+ <td><u>65.9</u></td>
286
+ <td>49.9</td>
287
+ <td>-</td>
288
+ <td><strong>95.2</strong></td>
289
+ <td>-</td>
290
+ <td>-</td>
291
+ <td>3.4</td>
292
+ </tr>
293
+ <tr>
294
+ <td nowrap="nowrap" align="left">Gemini 1.5 Pro</td>
295
+ <td>-</td>
296
+ <td>-</td>
297
+ <td>64.4</td>
298
+ <td>754</td>
299
+ <td>57.7</td>
300
+ <td>81.3</td>
301
+ <td>64.0</td>
302
+ <td>59.1</td>
303
+ <td>2110.6</td>
304
+ <td>73.9</td>
305
+ <td>79.1</td>
306
+ <td>60.6</td>
307
+ <td>45.6</td>
308
+ <td>73.5</td>
309
+ <td>86.5</td>
310
+ <td>-</td>
311
+ <td>19.2</td>
312
+ <td>-</td>
313
+ </tr>
314
+ <tr>
315
+ <td nowrap="nowrap" align="left">GPT-4o-mini-20240718</td>
316
+ <td>-</td>
317
+ <td>1088</td>
318
+ <td>64.1</td>
319
+ <td>785</td>
320
+ <td>52.4</td>
321
+ <td>-</td>
322
+ <td>66.9</td>
323
+ <td>54.8</td>
324
+ <td>2003.4</td>
325
+ <td>76.0</td>
326
+ <td>77.8</td>
327
+ <td>60.0</td>
328
+ <td>46.1</td>
329
+ <td>-</td>
330
+ <td>-</td>
331
+ <td>-</td>
332
+ <td>-</td>
333
+ <td>3.3</td>
334
+ </tr>
335
+ <tr>
336
+ <td colspan="19" align="left"><strong>Open Source</strong></td>
337
+ </tr>
338
+ <tr>
339
+ <td nowrap="nowrap" align="left">Cambrian-34B</td>
340
+ <td>34B</td>
341
+ <td><u>1820</u></td>
342
+ <td>58.3</td>
343
+ <td>591</td>
344
+ <td>50.3</td>
345
+ <td>75.6</td>
346
+ <td>53.2</td>
347
+ <td>54.2</td>
348
+ <td>2049.9</td>
349
+ <td>77.8</td>
350
+ <td>79.5</td>
351
+ <td>50.4</td>
352
+ <td>41.6</td>
353
+ <td>76.7</td>
354
+ <td>75.5</td>
355
+ <td>-</td>
356
+ <td>-</td>
357
+ <td>-</td>
358
+ </tr>
359
+ <tr>
360
+ <td nowrap="nowrap" align="left">GLM-4V-9B</td>
361
+ <td>13B</td>
362
+ <td>784</td>
363
+ <td>59.1</td>
364
+ <td>776</td>
365
+ <td>51.1</td>
366
+ <td>-</td>
367
+ <td>58.0</td>
368
+ <td>54.8</td>
369
+ <td>2018.8</td>
370
+ <td>67.9</td>
371
+ <td>71.2</td>
372
+ <td>46.9</td>
373
+ <td>45.0</td>
374
+ <td>-</td>
375
+ <td>-</td>
376
+ <td>-</td>
377
+ <td>-</td>
378
+ <td>-</td>
379
+ </tr>
380
+ <tr>
381
+ <td nowrap="nowrap" align="left">Pixtral-12B</td>
382
+ <td>12B</td>
383
+ <td>256</td>
384
+ <td>61.0</td>
385
+ <td>685</td>
386
+ <td>56.9</td>
387
+ <td>81.8</td>
388
+ <td>58.5</td>
389
+ <td>54.5</td>
390
+ <td>-</td>
391
+ <td>72.7</td>
392
+ <td>79.0</td>
393
+ <td>51.1</td>
394
+ <td>47.0</td>
395
+ <td>75.7</td>
396
+ <td>90.7</td>
397
+ <td>-</td>
398
+ <td>-</td>
399
+ <td>-</td>
400
+ </tr>
401
+ <tr>
402
+ <td nowrap="nowrap" align="left">VITA-1.5</td>
403
+ <td>8B</td>
404
+ <td>784</td>
405
+ <td>63.3</td>
406
+ <td>741</td>
407
+ <td>66.2</td>
408
+ <td>-</td>
409
+ <td>52.7</td>
410
+ <td>60.2</td>
411
+ <td>2328.1</td>
412
+ <td>76.8</td>
413
+ <td>79.2</td>
414
+ <td>52.6</td>
415
+ <td>44.6</td>
416
+ <td>-</td>
417
+ <td>-</td>
418
+ <td>-</td>
419
+ <td>-</td>
420
+ <td>-</td>
421
+ </tr>
422
+ <tr>
423
+ <td nowrap="nowrap" align="left">DeepSeek-VL2-27B (4B)</td>
424
+ <td>27B</td>
425
+ <td>672</td>
426
+ <td>66.4</td>
427
+ <td>809</td>
428
+ <td>63.9</td>
429
+ <td>86.0</td>
430
+ <td>60.0</td>
431
+ <td>61.9</td>
432
+ <td>2253.0</td>
433
+ <td>81.2</td>
434
+ <td>83.8</td>
435
+ <td>54.0</td>
436
+ <td>45.3</td>
437
+ <td><u>84.2</u></td>
438
+ <td>93.3</td>
439
+ <td>-</td>
440
+ <td>-</td>
441
+ <td>3.0</td>
442
+ </tr>
443
+ <tr>
444
+ <td nowrap="nowrap" align="left">Qwen2-VL-7B</td>
445
+ <td>8B</td>
446
+ <td>784</td>
447
+ <td>67.1</td>
448
+ <td><u>866</u></td>
449
+ <td>58.2</td>
450
+ <td>83.0</td>
451
+ <td>62.0</td>
452
+ <td>60.7</td>
453
+ <td>2326.0</td>
454
+ <td>81.8</td>
455
+ <td>83.0</td>
456
+ <td>54.1</td>
457
+ <td>50.6</td>
458
+ <td><strong>84.3</strong></td>
459
+ <td><u>94.5</u></td>
460
+ <td>31.9</td>
461
+ <td>16.3</td>
462
+ <td>3.2</td>
463
+ </tr>
464
+ <tr>
465
+ <td nowrap="nowrap" align="left">LLaVA-OneVision-72B</td>
466
+ <td>72B</td>
467
+ <td>182</td>
468
+ <td>68.1</td>
469
+ <td>741</td>
470
+ <td>67.5</td>
471
+ <td>83.7</td>
472
+ <td>60.6</td>
473
+ <td><strong>65.8</strong></td>
474
+ <td>2261.0</td>
475
+ <td><strong>85.0</strong></td>
476
+ <td><u>85.6</u></td>
477
+ <td>56.8</td>
478
+ <td>49.0</td>
479
+ <td>80.5</td>
480
+ <td>91.3</td>
481
+ <td>39.1</td>
482
+ <td>-</td>
483
+ <td>3.5</td>
484
+ </tr>
485
+ <tr>
486
+ <td nowrap="nowrap" align="left">InternVL2.5-8B</td>
487
+ <td>8B</td>
488
+ <td>706</td>
489
+ <td>68.3</td>
490
+ <td>822</td>
491
+ <td><u>64.4</u></td>
492
+ <td>84.8</td>
493
+ <td>62.8</td>
494
+ <td>62.8</td>
495
+ <td>2344.0</td>
496
+ <td><u>83.6</u></td>
497
+ <td>84.5</td>
498
+ <td>56.0</td>
499
+ <td>50.1</td>
500
+ <td>79.1</td>
501
+ <td>93.0</td>
502
+ <td>39.5</td>
503
+ <td>19.7</td>
504
+ <td>3.4</td>
505
+ </tr>
506
+ <tr>
507
+ <td nowrap="nowrap" align="left">MiniCPM-V 2.6</td>
508
+ <td>8B</td>
509
+ <td><strong>2822</strong></td>
510
+ <td>65.2</td>
511
+ <td>852*</td>
512
+ <td>60.6</td>
513
+ <td>79.4</td>
514
+ <td>60.0</td>
515
+ <td>57.5</td>
516
+ <td><u>2348.4*</u></td>
517
+ <td>78.0</td>
518
+ <td>82.1</td>
519
+ <td>49.8*</td>
520
+ <td>48.1*</td>
521
+ <td>80.1</td>
522
+ <td>90.8</td>
523
+ <td>25.7</td>
524
+ <td>18.3</td>
525
+ <td>3.6</td>
526
+ </tr>
527
+ <tr>
528
+ <td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
529
+ <td>8B</td>
530
+ <td><strong>2822</strong></td>
531
+ <td><strong>70.2</strong></td>
532
+ <td><strong>897*</strong></td>
533
+ <td><strong>71.9*</strong></td>
534
+ <td><u>86.9*</u></td>
535
+ <td><u>67.5</u></td>
536
+ <td><u>64.0</u></td>
537
+ <td><strong>2372.0*</strong></td>
538
+ <td>80.5</td>
539
+ <td><strong>85.8</strong></td>
540
+ <td>50.4*</td>
541
+ <td><u>51.9</u></td>
542
+ <td>82.0</td>
543
+ <td>93.5</td>
544
+ <td><u>41.4*</u></td>
545
+ <td><u>23.1*</u></td>
546
+ <td><strong>3.8</strong></td>
547
+ </tr>
548
+ </tbody>
549
+ </table>
550
+ </div>
551
+ * We evaluate this benchmark using chain-of-thought prompting. Specifically, for MME, we used this technique only for the Cognition set.
552
 
 
 
 
 
553
 
554
+ <sup>+</sup> Token Density: number of pixels encoded into each visual token at maximum resolution, i.e., # pixels at maximum resolution / # visual tokens.
555
 
556
+ Note: For proprietary models, we calculate token density based on the image encoding charging strategy defined in the official API documentation, which provides an upper-bound estimation.
 
 
 
 
 
 
 
 
557
 
 
 
 
 
 
 
 
 
558
 
559
+ **Multi-image and Video Understanding**
 
 
 
560
 
561
+ <div align="center">
562
+
563
+ <table style="margin: 0px auto;">
564
+ <thead>
565
+ <tr>
566
+ <th align="left">Model</th>
567
+ <th>Size</th>
568
+ <th>BLINK val</th>
569
+ <th>Mantis Eval</th>
570
+ <th>MIRB</th>
571
+ <th>Video-MME (wo / w subs)</th>
572
+ </tr>
573
+ </thead>
574
+ <tbody align="center">
575
+ <tr>
576
+ <td colspan="6" align="left"><strong>Proprietary</strong></td>
577
+ </tr>
578
+ <tr>
579
+ <td nowrap="nowrap" align="left">GPT-4o-20240513</td>
580
+ <td>-</td>
581
+ <td><strong>68.0</strong></td>
582
+ <td>-</td>
583
+ <td>-</td>
584
+ <td><strong>71.9/77.2<strong></td>
585
+ </tr>
586
+ <tr>
587
+ <td nowrap="nowrap" align="left">GPT4V</td>
588
+ <td>-</td>
589
+ <td>54.6</td>
590
+ <td>62.7</td>
591
+ <td>53.1</td>
592
+ <td>59.9/63.3</td>
593
+ </tr>
594
+ <tr>
595
+ <td colspan="6" align="left"><strong>Open-source</strong></td>
596
+ </tr>
597
+ <tr>
598
+ <td nowrap="nowrap" align="left">VITA-1.5</td>
599
+ <td>8B</td>
600
+ <td>45.0</td>
601
+ <td>-</td>
602
+ <td>-</td>
603
+ <td>56.1/58.7</td>
604
+ </tr>
605
+ <tr>
606
+ <td nowrap="nowrap" align="left">LLaVA-NeXT-Interleave 14B</td>
607
+ <td>14B</td>
608
+ <td>52.6</td>
609
+ <td>66.4</td>
610
+ <td>30.2</td>
611
+ <td>-</td>
612
+ </tr>
613
+ <tr>
614
+ <td nowrap="nowrap" align="left">LLaVA-OneVision-72B</td>
615
+ <td>72B</td>
616
+ <td>55.4</td>
617
+ <td><strong>77.6</strong></td>
618
+ <td>-</td>
619
+ <td><u>66.2/69.5</u></td>
620
+ </tr>
621
+ <tr>
622
+ <td nowrap="nowrap" align="left">MANTIS 8B</td>
623
+ <td>8B</td>
624
+ <td>49.1</td>
625
+ <td>59.5</td>
626
+ <td>34.8</td>
627
+ <td>-</td>
628
+ </tr>
629
+ <tr>
630
+ <td nowrap="nowrap" align="left">Qwen2-VL-7B</td>
631
+ <td>8B</td>
632
+ <td>53.2</td>
633
+ <td>69.6*</td>
634
+ <td><strong>67.6*</strong></td>
635
+ <td>63.3/69.0</td>
636
+ </tr>
637
+ <tr>
638
+ <td nowrap="nowrap" align="left">InternVL2.5-8B</td>
639
+ <td>8B</td>
640
+ <td>54.8</td>
641
+ <td>67.7</td>
642
+ <td>52.5</td>
643
+ <td>64.2/66.9</td>
644
+ </tr>
645
+ <tr>
646
+ <td nowrap="nowrap" align="left">MiniCPM-V 2.6</td>
647
+ <td>8B</td>
648
+ <td>53.0</td>
649
+ <td>69.1</td>
650
+ <td>53.8</td>
651
+ <td>60.9/63.6</td>
652
+ </tr>
653
+ <tr>
654
+ <td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
655
+ <td>8B</td>
656
+ <td><u>56.7</u></td>
657
+ <td><u>71.9</u></td>
658
+ <td><u>58.6</u></td>
659
+ <td>63.9/67.9</td>
660
+ </tr>
661
+ </tbody>
662
+ </table>
663
 
664
+ </div>
665
+ * We evaluate officially released checkpoints by ourselves.
666
 
 
 
 
 
 
 
667
  </details>
668
 
669
 
 
670
  <details>
671
+ <summary>Click to view audio understanding and speech conversation results.</summary>
672
 
673
+ **Audio Understanding**
 
 
 
674
 
675
+ <div align="center">
676
+ <table style="margin: 0px auto;">
677
+ <thead>
678
+ <tr>
679
+ <th align="left">Task</th>
680
+ <th>Size</th>
681
+ <th colspan="3">ASR (zh)</th>
682
+ <th colspan="3">ASR (en)</th>
683
+ <th colspan="2">AST</th>
684
+ <th>Emotion</th>
685
+ </tr>
686
+ <tr>
687
+ <th align="left">Metric</th>
688
+ <td></td>
689
+ <th colspan="3">CER↓</th>
690
+ <th colspan="3">WER↓</th>
691
+ <th colspan="2">BLEU↑</th>
692
+ <th>ACC↑</th>
693
+ </tr>
694
+ <tr>
695
+ <th align="left">Dataset</th>
696
+ <td></td>
697
+ <th>AISHELL-1</th>
698
+ <th>Fleurs zh</th>
699
+ <th>WenetSpeech test-net</th>
700
+ <th>LibriSpeech test-clean</th>
701
+ <th>GigaSpeech</th>
702
+ <th>TED-LIUM</th>
703
+ <th>CoVoST en2zh</th>
704
+ <th>CoVoST zh2en</th>
705
+ <th>MELD emotion</th>
706
+ </tr>
707
+ </thead>
708
+ <tbody align="center">
709
+ <tr>
710
+ <td colspan="11" align="left"><strong>Proprietary</strong></td>
711
+ </tr>
712
+ <tr>
713
+ <td nowrap="nowrap" align="left">GPT-4o-Realtime</td>
714
+ <td>-</td>
715
+ <td>7.3*</td>
716
+ <td><u>5.4*</u></td>
717
+ <td>28.9*</td>
718
+ <td>2.6*</td>
719
+ <td>12.9*</td>
720
+ <td>4.8*</td>
721
+ <td>37.1*</td>
722
+ <td>15.7*</td>
723
+ <td>33.2*</td>
724
+ </tr>
725
+ <tr>
726
+ <td nowrap="nowrap" align="left">Gemini 1.5 Pro</td>
727
+ <td>-</td>
728
+ <td>4.5*</td>
729
+ <td>5.9*</td>
730
+ <td>14.3*</td>
731
+ <td>2.9*</td>
732
+ <td>10.6*</td>
733
+ <td><strong>3.0*</strong></td>
734
+ <td><u>47.3*</u></td>
735
+ <td>22.6*</td>
736
+ <td>48.4*</td>
737
+ </tr>
738
+ <tr>
739
+ <td colspan="11" align="left"><strong>Open-Source</strong></td>
740
+ </tr>
741
+ <tr>
742
+ <td nowrap="nowrap" align="left">Qwen2-Audio-7B</td>
743
+ <td>8B</td>
744
+ <td>-</td>
745
+ <td>7.5</td>
746
+ <td>-</td>
747
+ <td><strong>1.6</strong></td>
748
+ <td>-</td>
749
+ <td>-</td>
750
+ <td>45.2</td>
751
+ <td><u>24.4</u></td>
752
+ <td><strong>55.3</strong></td>
753
+ </tr>
754
+ <tr>
755
+ <td nowrap="nowrap" align="left">Qwen2-Audio-7B-Instruct</td>
756
+ <td>8B</td>
757
+ <td>2.6*</td>
758
+ <td>6.9*</td>
759
+ <td><u>10.3*</u></td>
760
+ <td>3.1*</td>
761
+ <td><u>9.7</u>*</td>
762
+ <td>5.9*</td>
763
+ <td>39.5*</td>
764
+ <td>22.9*</td>
765
+ <td>17.4*</td>
766
+ </tr>
767
+ <tr>
768
+ <td nowrap="nowrap" align="left">VITA-1.5</td>
769
+ <td>8B</td>
770
+ <td>2.16</td>
771
+ <td>-</td>
772
+ <td>8.4</td>
773
+ <td>3.4</td>
774
+ <td>-</td>
775
+ <td>-</td>
776
+ <td>-</td>
777
+ <td>-</td>
778
+ <td>-</td>
779
+ </tr>
780
+ <tr>
781
+ <td nowrap="nowrap" align="left">GLM-4-Voice-Base</td>
782
+ <td>9B</td>
783
+ <td><u>2.5</u></td>
784
+ <td>-</td>
785
+ <td>-</td>
786
+ <td>2.8</td>
787
+ <td>-</td>
788
+ <td>-</td>
789
+ <td>-</td>
790
+ <td>-</td>
791
+ </tr>
792
+ <tr>
793
+ <td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
794
+ <td>8B</td>
795
+ <td><strong>1.6</strong></td>
796
+ <td><strong>4.4</strong></td>
797
+ <td><strong>6.9</strong></td>
798
+ <td><u>1.7</u></td>
799
+ <td><strong>8.7</strong></td>
800
+ <td><strong>3.0</strong></td>
801
+ <td><strong>48.2</strong></td>
802
+ <td><strong>27.2</strong></td>
803
+ <td><u>52.4</u></td>
804
+ </tr>
805
+ </tbody>
806
+ </table>
807
+ </div>
808
+ * We evaluate officially released checkpoints by ourselves.<br><br>
809
+
810
+ **Speech Generation**
811
+
812
+ <div align="center">
813
+ <table style="margin: 0px auto;">
814
+ <thead>
815
+ <tr>
816
+ <th align="left">Task</th>
817
+ <th>Size</th>
818
+ <th colspan="9">SpeechQA</th>
819
+ </tr>
820
+ <tr>
821
+ <th align="left">Metric</th>
822
+ <th></th>
823
+ <th colspan="3">ACC↑</th>
824
+ <th>G-Eval (10 point)↑</th>
825
+ <th>Semantic ELO score↑</th>
826
+ <th>Acoustic ELO score↑</th>
827
+ <th>Overall ELO score↑</th>
828
+ <th>UTMOS↑</th>
829
+ <th>ASR-WER↓</th>
830
+ </tr>
831
+ <tr>
832
+ <th align="left">Dataset</th>
833
+ <th></th>
834
+ <th>Speech Llama Q.</th>
835
+ <th>Speech Web Q.</th>
836
+ <th>Speech Trivia QA</th>
837
+ <th>Speech AlpacaEval</th>
838
+ <th colspan="5">AudioArena</th>
839
+ </tr>
840
+ </thead>
841
+ <tbody align="center">
842
+ <tr>
843
+ <td colspan="11" align="left"><strong>Proprietary</strong></td>
844
+ </tr>
845
+ <tr>
846
+ <td nowrap="nowrap" align="left">GPT-4o-Realtime</td>
847
+ <td></td>
848
+ <td><strong>71.7</strong></td>
849
+ <td><strong>51.6</strong></td>
850
+ <td><strong>69.7</strong></td>
851
+ <td><strong>7.4</strong></td>
852
+ <td><strong>1157</strong></td>
853
+ <td><strong>1203</strong></td>
854
+ <td><strong>1200</strong></td>
855
+ <td><strong>4.2</strong></td>
856
+ <td><strong>2.3</strong></td>
857
+ </tr>
858
+ <tr>
859
+ <td colspan="11" align="left"><strong>Open-Source</strong></td>
860
+ </tr>
861
+ <tr>
862
+ <td nowrap="nowrap" align="left">GLM-4-Voice</td>
863
+ <td>9B</td>
864
+ <td>50.0</td>
865
+ <td>32.0</td>
866
+ <td>36.4</td>
867
+ <td><u>5.1</u></td>
868
+ <td>999</td>
869
+ <td>1147</td>
870
+ <td>1035</td>
871
+ <td><u>4.1</u></td>
872
+ <td><u>11.7</u></td>
873
+ </tr>
874
+ <tr>
875
+ <td nowrap="nowrap" align="left">Llama-Omni</td>
876
+ <td>8B</td>
877
+ <td>45.3</td>
878
+ <td>22.9</td>
879
+ <td>10.7</td>
880
+ <td>3.9</td>
881
+ <td>960</td>
882
+ <td>878</td>
883
+ <td>897</td>
884
+ <td>3.2</td>
885
+ <td>24.3</td>
886
+ </tr>
887
+ <tr>
888
+ <td nowrap="nowrap" align="left">VITA-1.5</td>
889
+ <td>8B</td>
890
+ <td>46.7</td>
891
+ <td>28.1</td>
892
+ <td>23.3</td>
893
+ <td>2.0</td>
894
+ <td>-</td>
895
+ <td>-</td>
896
+ <td>-</td>
897
+ <td>-</td>
898
+ <td>-</td>
899
+ </tr>
900
+ <tr>
901
+ <td nowrap="nowrap" align="left">Moshi</td>
902
+ <td>7B</td>
903
+ <td>43.7</td>
904
+ <td>23.8</td>
905
+ <td>16.7</td>
906
+ <td>2.4</td>
907
+ <td>871</td>
908
+ <td>808</td>
909
+ <td>875</td>
910
+ <td>2.8</td>
911
+ <td>8.2</td>
912
+ </tr>
913
+ <tr>
914
+ <td nowrap="nowrap" align="left">Mini-Omni</td>
915
+ <td>1B</td>
916
+ <td>22.0</td>
917
+ <td>12.8</td>
918
+ <td>6.9</td>
919
+ <td>2.5</td>
920
+ <td>926</td>
921
+ <td>803</td>
922
+ <td>865</td>
923
+ <td>3.4</td>
924
+ <td>10.0</td>
925
+ </tr>
926
+ <tr>
927
+ <td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
928
+ <td>8B</td>
929
+ <td><u>61.0</u></td>
930
+ <td><u>40.0</u></td>
931
+ <td><u>40.2</u></td>
932
+ <td><u>5.1</u></td>
933
+ <td><u>1088</u></td>
934
+ <td><u>1163</u></td>
935
+ <td><u>1131</u></td>
936
+ <td><strong>4.2</strong></td>
937
+ <td>9.8</td>
938
+ </tr>
939
+ </tbody>
940
+ </table>
941
+ </div>
942
+ All results are from AudioEvals, and the evaluation methods along with further details can be found in <a href="https://github.com/OpenBMB/UltraEval-Audio" target="_blank">AudioEvals</a>.<br><br>
943
+
944
+ **End-to-end Voice Cloning**
945
+
946
+ <div align="center">
947
+ <table style="margin: 0px auto;">
948
+ <thead>
949
+ <tr>
950
+ <th align="left">Task</th>
951
+ <th colspan="2">Voice cloning</th>
952
+ </tr>
953
+ <tr>
954
+ <th align="left">Metric</th>
955
+ <th>SIMO↑</th>
956
+ <th>SIMO↑</th>
957
+ </tr>
958
+ <tr>
959
+ <th align="left">Dataset</th>
960
+ <th>Seed-TTS test-zh</th>
961
+ <th>Seed-TTS test-en</th>
962
+ </tr>
963
+ </thead>
964
+ <tbody align="center">
965
+ <tr>
966
+ <td nowrap="nowrap" align="left">F5-TTS</td>
967
+ <td><strong>76</strong></td>
968
+ <td><strong>67</strong></td>
969
+ </tr>
970
+ <tr>
971
+ <td nowrap="nowrap" align="left">CosyVoice</td>
972
+ <td><u>75</u></td>
973
+ <td><u>64</u></td>
974
+ </tr>
975
+ <tr>
976
+ <td nowrap="nowrap" align="left">FireRedTTS</td>
977
+ <td>63</td>
978
+ <td>46</td>
979
+ </tr>
980
+ <tr>
981
+ <td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
982
+ <td>57</td>
983
+ <td>47</td>
984
+ </tr>
985
+ </tbody>
986
+ </table>
987
+ </div>
988
+
989
+ </details>
990
+
991
+ <details>
992
+ <summary>Click to view multimodal live streaming results.</summary>
993
+
994
+ **Multimodal Live Streaming**: results on StreamingBench
995
+
996
+ <table style="margin: 0px auto;">
997
+ <thead>
998
+ <tr>
999
+ <th align="left">Model</th>
1000
+ <th>Size</th>
1001
+ <th>Real-Time Video Understanding</th>
1002
+ <th>Omni-Source Understanding</th>
1003
+ <th>Contextual Understanding</th>
1004
+ <th>Overall</th>
1005
+ </tr>
1006
+ </thead>
1007
+ <tbody align="center">
1008
+ <tr>
1009
+ <td colspan="7" align="left"><strong>Proprietary</strong></td>
1010
+ </tr>
1011
+ <tr>
1012
+ <td nowrap="nowrap" align="left">Gemini 1.5 Pro</td>
1013
+ <td>-</td>
1014
+ <td><u>77.4</u></td>
1015
+ <td><strong>67.8</strong></td>
1016
+ <td><strong>51.1</strong></td>
1017
+ <td><strong>70.3</strong></td>
1018
+ </tr>
1019
+ <tr>
1020
+ <td nowrap="nowrap" align="left">GPT-4o-202408</td>
1021
+ <td>-</td>
1022
+ <td>74.5</td>
1023
+ <td>51.0</td>
1024
+ <td><u>48.0</u></td>
1025
+ <td>64.1</td>
1026
+ </tr>
1027
+ <tr>
1028
+ <td nowrap="nowrap" align="left">Claude-3.5-Sonnet</td>
1029
+ <td>-</td>
1030
+ <td>74.0</td>
1031
+ <td>41.4</td>
1032
+ <td>37.8</td>
1033
+ <td>59.7</td>
1034
+ </tr>
1035
+ <tr>
1036
+ <td colspan="9" align="left"><strong>Open-source</strong></td>
1037
+ </tr>
1038
+ <tr>
1039
+ <td nowrap="nowrap" align="left">VILA-1.5</td>
1040
+ <td>8B</td>
1041
+ <td>61.5</td>
1042
+ <td>37.5</td>
1043
+ <td>26.7</td>
1044
+ <td>49.5</td>
1045
+ </tr>
1046
+ <tr>
1047
+ <td nowrap="nowrap" align="left">LongVA</td>
1048
+ <td>7B</td>
1049
+ <td>63.1</td>
1050
+ <td>35.9</td>
1051
+ <td>30.2</td>
1052
+ <td>50.7</td>
1053
+ </tr>
1054
+ <tr>
1055
+ <td nowrap="nowrap" align="left">LLaVA-Next-Video-34B</td>
1056
+ <td>34B</td>
1057
+ <td>69.8</td>
1058
+ <td>41.7</td>
1059
+ <td>34.3</td>
1060
+ <td>56.7</td>
1061
+ </tr>
1062
+ <tr>
1063
+ <td nowrap="nowrap" align="left">Qwen2-VL-7B</td>
1064
+ <td>8B</td>
1065
+ <td>71.2</td>
1066
+ <td>40.7</td>
1067
+ <td>33.1</td>
1068
+ <td>57.0</td>
1069
+ </tr>
1070
+ <tr>
1071
+ <td nowrap="nowrap" align="left">InternVL2-8B</td>
1072
+ <td>8B</td>
1073
+ <td>70.1</td>
1074
+ <td>42.7</td>
1075
+ <td>34.1</td>
1076
+ <td>57.0</td>
1077
+ </tr>
1078
+ <tr>
1079
+ <td nowrap="nowrap" align="left">VITA-1.5</td>
1080
+ <td>8B</td>
1081
+ <td>70.9</td>
1082
+ <td>40.8</td>
1083
+ <td>35.8</td>
1084
+ <td>57.4</td>
1085
+ </tr>
1086
+ <tr>
1087
+ <td nowrap="nowrap" align="left">LLaVA-OneVision-7B</td>
1088
+ <td>8B</td>
1089
+ <td>74.3</td>
1090
+ <td>40.8</td>
1091
+ <td>31.0</td>
1092
+ <td>58.4</td>
1093
+ </tr>
1094
+ <tr>
1095
+ <td nowrap="nowrap" align="left">InternLM-XC2.5-OL-7B</td>
1096
+ <td>8B</td>
1097
+ <td>75.4</td>
1098
+ <td>46.2</td>
1099
+ <td>33.6</td>
1100
+ <td>60.8</td>
1101
+ </tr>
1102
+ <tr>
1103
+ <td nowrap="nowrap" align="left">MiniCPM-V 2.6</td>
1104
+ <td>8B</td>
1105
+ <td>72.4</td>
1106
+ <td>40.2</td>
1107
+ <td>33.4</td>
1108
+ <td>57.7</td>
1109
+ </tr>
1110
+ <tr>
1111
+ <td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
1112
+ <td>8B</td>
1113
+ <td><strong>79.9</strong></td>
1114
+ <td><u>53.4</u></td>
1115
+ <td>38.5</td>
1116
+ <td><u>66.0</u></td>
1117
+ </tr>
1118
+ </tbody>
1119
+ </table>
1120
+
1121
+ </details>
1122
+
1123
+
1124
+ ### Examples <!-- omit in toc -->
1125
+
1126
+ We deploy MiniCPM-o 2.6 on end devices. The demo video is the raw-speed recording on an iPad Pro and a Web demo.
1127
+
1128
+ <div align="center">
1129
+ <a href="https://www.youtube.com/watch?v=vRIMbxJzStY&t=2s"><img src="./assets/minicpmo2_6/2dot6_o_demo_video_img.png", width=70%></a>
1130
+ </div>
1131
+
1132
+ <br>
1133
+
1134
+ <div style="display: flex; flex-direction: column; align-items: center;">
1135
+ <img src="assets/minicpmo2_6/minicpmo2_6_math_intersect.png" alt="math" style="margin-bottom: 5px;">
1136
+ <img src="assets/minicpmo2_6/minicpmo2_6_diagram_train_NN.png" alt="diagram" style="margin-bottom: 5px;">
1137
+ <img src="assets/minicpmo2_6/minicpmo2_6_multi-image_bike.png" alt="bike" style="margin-bottom: 5px;">
1138
+ </div>
1139
+
1140
+
1141
+ ## Legacy Models <!-- omit in toc -->
1142
+
1143
+ | Model | Introduction and Guidance |
1144
+ |:-----------|:-------------------:|
1145
+ | MiniCPM-V 4.0 | [Document](./docs/minicpm_v4_en.md) |
1146
+ | MiniCPM-V 2.6 | [Document](./docs/minicpm_v2dot6_en.md) |
1147
+ | MiniCPM-Llama3-V 2.5 | [Document](./docs/minicpm_llama3_v2dot5.md) |
1148
+ | MiniCPM-V 2.0 | [Document](./docs/minicpm_v2.md) |
1149
+ | MiniCPM-V 1.0 | [Document](./docs/minicpm_v1.md) |
1150
+ | OmniLMM-12B | [Document](././docs/omnilmm_en.md) |
1151
+
1152
+
1153
+ ## MiniCPM-V & o Cookbook
1154
+
1155
+ Discover comprehensive, ready-to-deploy solutions for the MiniCPM-V and MiniCPM-o model series in our structured [cookbook](https://github.com/OpenSQZ/MiniCPM-V-CookBook), which empowers developers to rapidly implement multimodal AI applications with integrated vision, speech, and live-streaming capabilities. Key features include:
1156
+
1157
+ **Easy Usage Documentation**
1158
+
1159
+ Our comprehensive [documentation website](https://minicpm-o.readthedocs.io/en/latest/index.html) presents every recipe in a clear, well-organized manner.
1160
+ All features are displayed at a glance, making it easy for you to quickly find exactly what you need.
1161
+
1162
+ **Broad User Spectrum**
1163
+
1164
+ We support a wide range of users, from individuals to enterprises and researchers.
1165
+
1166
+ * **Individuals**: Enjoy effortless inference using [Ollama](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/deployment/ollama/minicpm-v4_ollama.md) and [Llama.cpp](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/deployment/llama.cpp/minicpm-v4_llamacpp.md) with minimal setup.
1167
+ * **Enterprises**: Achieve high-throughput, scalable performance with [vLLM](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/deployment/vllm/minicpm-v4_vllm.md) and [SGLang](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/deployment/sglang/MiniCPM-v4_sglang.md).
1168
+ * **Researchers**: Leverage advanced frameworks including [Transformers](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/finetune/finetune_full.md), [LLaMA-Factory](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/finetune/finetune_llamafactory.md), [SWIFT](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/finetune/swift.md), and [Align-anything](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/finetune/align_anything.md) to enable flexible model development and cutting-edge experimentation.
1169
+
1170
+ **Versatile Deployment Scenarios**
1171
+
1172
+ Our ecosystem delivers optimal solution for a variety of hardware environments and deployment demands.
1173
+
1174
+ * **Web demo**: Launch interactive multimodal AI web demo with [FastAPI](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/demo/README.md).
1175
+ * **Quantized deployment**: Maximize efficiency and minimize resource consumption using [GGUF](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/quantization/gguf/minicpm-v4_gguf_quantize.md) and [BNB](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/quantization/bnb/minicpm-v4_bnb_quantize.md).
1176
+ * **End devices**: Bring powerful AI experiences to [iPhone and iPad](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/demo/ios_demo/ios.md), supporting offline and privacy-sensitive applications.
1177
+
1178
+
1179
+ ## Chat with Our Demo on Gradio 🤗
1180
+
1181
+ We provide online and local demos powered by Hugging Face Gradio <a href='https://github.com/gradio-app/gradio'><img src='https://img.shields.io/github/stars/gradio-app/gradio'></a>, the most popular model deployment framework nowadays. It supports streaming outputs, progress bars, queuing, alerts, and other useful features.
1182
+
1183
+
1184
+ ### Online Demo <!-- omit in toc -->
1185
+
1186
+ Click here to try out the online demo of [MiniCPM-o 2.6](https://minicpm-omni-webdemo-us.modelbest.cn/) | [MiniCPM-V 2.6](http://120.92.209.146:8887/) | [MiniCPM-Llama3-V 2.5](https://huggingface.co/spaces/openbmb/MiniCPM-Llama3-V-2_5) | [MiniCPM-V 2.0](https://huggingface.co/spaces/openbmb/MiniCPM-V-2).
1187
+
1188
+ ### Local WebUI Demo <!-- omit in toc -->
1189
+
1190
+ You can easily build your own local WebUI demo using the following commands.
1191
+
1192
+ Please ensure that `transformers==4.44.2` is installed, as other versions may have compatibility issues.
1193
+
1194
+ If you are using an older version of PyTorch, you might encounter this issue `"weight_norm_fwd_first_dim_kernel" not implemented for 'BFloat16'`, Please add `self.minicpmo_model.tts.float()` during the model initialization.
1195
+
1196
+ **For real-time voice/video call demo:**
1197
+ 1. launch model server:
1198
+ ```shell
1199
+ pip install -r requirements_o2.6.txt
1200
+
1201
+ python web_demos/minicpm-o_2.6/model_server.py
1202
+ ```
1203
+
1204
+ 2. launch web server:
1205
+
1206
+ ```shell
1207
+ # Make sure Node and PNPM is installed.
1208
+ sudo apt-get update
1209
+ sudo apt-get install nodejs npm
1210
+ npm install -g pnpm
1211
+
1212
+
1213
+ cd web_demos/minicpm-o_2.6/web_server
1214
+ # create ssl cert for https, https is required to request camera and microphone permissions.
1215
+ bash ./make_ssl_cert.sh # output key.pem and cert.pem
1216
+
1217
+ pnpm install # install requirements
1218
+ pnpm run dev # start server
1219
+ ```
1220
+ Open `https://localhost:8088/` in browser and enjoy the real-time voice/video call.
1221
+
1222
+ **For chatbot demo:**
1223
+ ```shell
1224
+ pip install -r requirements_o2.6.txt
1225
+
1226
+ python web_demos/minicpm-o_2.6/chatbot_web_demo_o2.6.py
1227
+ ```
1228
+ Open `http://localhost:8000/` in browser and enjoy the vision mode chatbot.
1229
+
1230
+ ## Inference
1231
+
1232
+
1233
+ ### Model Zoo
1234
+
1235
+ | Model | Device | Memory | &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; Description | Download |
1236
+ |:-----------|:--:|:-----------:|:-------------------|:---------------:|
1237
+ | MiniCPM-V 4.5| GPU | 18 GB | The latest version, strong end-side multimodal performance for single image, multi-image and video understanding. | [🤗](https://huggingface.co/openbmb/MiniCPM-V-4_5) &nbsp;&nbsp; [<img src="./assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-V-4_5) |
1238
+ | MiniCPM-V 4.5 gguf | CPU | 8 GB | The gguf version, lower memory usage and faster inference. | [🤗](https://huggingface.co/openbmb/MiniCPM-V-4_5-gguf) &nbsp;&nbsp; [<img src="./assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-V-4_5-gguf) |
1239
+ | MiniCPM-V 4.5 int4 | GPU | 9 GB | The int4 quantized version, lower GPU memory usage. | [🤗](https://huggingface.co/openbmb/MiniCPM-V-4_5-int4) &nbsp;&nbsp; [<img src="./assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-V-4_5-int4) |
1240
+ | MiniCPM-V 4.5 AWQ | GPU | 9 GB | The int4 quantized version, lower GPU memory usage. | [🤗](https://huggingface.co/openbmb/MiniCPM-V-4_5-AWQ) &nbsp;&nbsp; [<img src="./assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-V-4_5-AWQ) |
1241
+ | MiniCPM-o 2.6| GPU | 18 GB | The latest version, achieving GPT-4o level performance for vision, speech and multimodal live streaming on end-side devices. | [🤗](https://huggingface.co/openbmb/MiniCPM-o-2_6) &nbsp;&nbsp; [<img src="./assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-o-2_6) |
1242
+ | MiniCPM-o 2.6 gguf | CPU | 8 GB | The gguf version, lower memory usage and faster inference. | [🤗](https://huggingface.co/openbmb/MiniCPM-o-2_6-gguf) &nbsp;&nbsp; [<img src="./assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-o-2_6-gguf) |
1243
+ | MiniCPM-o 2.6 int4 | GPU | 9 GB | The int4 quantized version, lower GPU memory usage. | [🤗](https://huggingface.co/openbmb/MiniCPM-o-2_6-int4) &nbsp;&nbsp; [<img src="./assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-o-2_6-int4) |
1244
+
1245
+ ### Multi-turn Conversation
1246
+
1247
+ If you wish to enable long-thinking mode, provide the argument `enable_thinking=True` to the chat function.
1248
+
1249
+ ```shell
1250
+ pip install -r requirements_o2.6.txt
1251
+ ```
1252
+
1253
+ Please refer to the following codes to run.
1254
+
1255
+ <div align="center">
1256
+ <img src="assets/minicpmo2_6/show_demo.jpg" width="500px">
1257
+ </div>
1258
+
1259
+
1260
+ ```python
1261
+ import torch
1262
+ from PIL import Image
1263
+ from transformers import AutoModel, AutoTokenizer
1264
+
1265
+ torch.manual_seed(100)
1266
+
1267
+ model = AutoModel.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True, # or openbmb/MiniCPM-o-2_6
1268
+ attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
1269
+ model = model.eval().cuda()
1270
+ tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True) # or openbmb/MiniCPM-o-2_6
1271
+
1272
+ image = Image.open('./assets/minicpmo2_6/show_demo.jpg').convert('RGB')
1273
+
1274
+ enable_thinking=False # If `enable_thinking=True`, the long-thinking mode is enabled.
1275
+
1276
+ # First round chat
1277
+ question = "What is the landform in the picture?"
1278
+ msgs = [{'role': 'user', 'content': [image, question]}]
1279
+
1280
+ answer = model.chat(
1281
+ msgs=msgs,
1282
+ tokenizer=tokenizer,
1283
+ enable_thinking=enable_thinking
1284
+ )
1285
+ print(answer)
1286
+
1287
+ # Second round chat, pass history context of multi-turn conversation
1288
+ msgs.append({"role": "assistant", "content": [answer]})
1289
+ msgs.append({"role": "user", "content": ["What should I pay attention to when traveling here?"]})
1290
+
1291
+ answer = model.chat(
1292
+ msgs=msgs,
1293
+ tokenizer=tokenizer
1294
+ )
1295
+ print(answer)
1296
+ ```
1297
+
1298
+ You will get the following output:
1299
+
1300
+ ```shell
1301
+ # round1
1302
+ The landform in the picture is karst topography. Karst landscapes are characterized by distinctive, jagged limestone hills or mountains with steep, irregular peaks and deep valleys—exactly what you see here These unique formations result from the dissolution of soluble rocks like limestone over millions of years through water erosion.
1303
+
1304
+ This scene closely resembles the famous karst landscape of Guilin and Yangshuo in China’s Guangxi Province. The area features dramatic, pointed limestone peaks rising dramatically above serene rivers and lush green forests, creating a breathtaking and iconic natural beauty that attracts millions of visitors each year for its picturesque views.
1305
+
1306
+ # round2
1307
+ When traveling to a karst landscape like this, here are some important tips:
1308
+
1309
+ 1. Wear comfortable shoes: The terrain can be uneven and hilly.
1310
+ 2. Bring water and snacks for energy during hikes or boat rides.
1311
+ 3. Protect yourself from the sun with sunscreen, hats, and sunglasses—especially since you’ll likely spend time outdoors exploring scenic spots.
1312
+ 4. Respect local customs and nature regulations by not littering or disturbing wildlife.
1313
+
1314
+ By following these guidelines, you'll have a safe and enjoyable trip while appreciating the stunning natural beauty of places such as Guilin’s karst mountains.
1315
+ ```
1316
+
1317
+ #### Chat with Multiple Images
1318
+ <details>
1319
+ <summary> Click to view Python code running MiniCPM-V-4_5 with multiple images input. </summary>
1320
+
1321
+ ```python
1322
+ import torch
1323
+ from PIL import Image
1324
+ from transformers import AutoModel, AutoTokenizer
1325
+
1326
+ model = AutoModel.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True, # or openbmb/MiniCPM-o-2_6
1327
+ attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
1328
+ model = model.eval().cuda()
1329
+ tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True) # or openbmb/MiniCPM-o-2_6
1330
+
1331
+ image1 = Image.open('image1.jpg').convert('RGB')
1332
+ image2 = Image.open('image2.jpg').convert('RGB')
1333
+ question = 'Compare image 1 and image 2, tell me about the differences between image 1 and image 2.'
1334
+
1335
+ msgs = [{'role': 'user', 'content': [image1, image2, question]}]
1336
+
1337
+ answer = model.chat(
1338
+ msgs=msgs,
1339
+ tokenizer=tokenizer
1340
+ )
1341
+ print(answer)
1342
+ ```
1343
+ </details>
1344
+
1345
+ #### In-context Few-shot Learning
1346
+ <details>
1347
+ <summary> Click to view Python code running MiniCPM-V-4_5 with few-shot input. </summary>
1348
+
1349
+ ```python
1350
+ import torch
1351
+ from PIL import Image
1352
+ from transformers import AutoModel, AutoTokenizer
1353
+
1354
+ model = AutoModel.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True, # or openbmb/MiniCPM-o-2_6
1355
+ attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
1356
+ model = model.eval().cuda()
1357
+ tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True) # or openbmb/MiniCPM-o-2_6
1358
+
1359
+ question = "production date"
1360
+ image1 = Image.open('example1.jpg').convert('RGB')
1361
+ answer1 = "2023.08.04"
1362
+ image2 = Image.open('example2.jpg').convert('RGB')
1363
+ answer2 = "2007.04.24"
1364
+ image_test = Image.open('test.jpg').convert('RGB')
1365
+
1366
+ msgs = [
1367
+ {'role': 'user', 'content': [image1, question]}, {'role': 'assistant', 'content': [answer1]},
1368
+ {'role': 'user', 'content': [image2, question]}, {'role': 'assistant', 'content': [answer2]},
1369
+ {'role': 'user', 'content': [image_test, question]}
1370
+ ]
1371
+
1372
+ answer = model.chat(
1373
+ msgs=msgs,
1374
+ tokenizer=tokenizer
1375
+ )
1376
+ print(answer)
1377
+ ```
1378
+ </details>
1379
+
1380
+ #### Chat with Video
1381
+ <details>
1382
+ <summary> Click to view Python code running MiniCPM-V-4_5 by with video input and 3D-Resampler. </summary>
1383
+
1384
+ ```python
1385
+ ## The 3d-resampler compresses multiple frames into 64 tokens by introducing temporal_ids.
1386
+ # To achieve this, you need to organize your video data into two corresponding sequences:
1387
+ # frames: List[Image]
1388
+ # temporal_ids: List[List[Int]].
1389
+
1390
+ import torch
1391
+ from PIL import Image
1392
+ from transformers import AutoModel, AutoTokenizer
1393
+ from decord import VideoReader, cpu # pip install decord
1394
+ from scipy.spatial import cKDTree
1395
+ import numpy as np
1396
+ import math
1397
+
1398
+ model = AutoModel.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True, # or openbmb/MiniCPM-o-2_6
1399
+ attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
1400
+ model = model.eval().cuda()
1401
+ tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True) # or openbmb/MiniCPM-o-2_6
1402
+
1403
+ MAX_NUM_FRAMES=180 # Indicates the maximum number of frames received after the videos are packed. The actual maximum number of valid frames is MAX_NUM_FRAMES * MAX_NUM_PACKING.
1404
+ MAX_NUM_PACKING=3 # indicates the maximum packing number of video frames. valid range: 1-6
1405
+ TIME_SCALE = 0.1
1406
+
1407
+ def map_to_nearest_scale(values, scale):
1408
+ tree = cKDTree(np.asarray(scale)[:, None])
1409
+ _, indices = tree.query(np.asarray(values)[:, None])
1410
+ return np.asarray(scale)[indices]
1411
+
1412
+
1413
+ def group_array(arr, size):
1414
+ return [arr[i:i+size] for i in range(0, len(arr), size)]
1415
+
1416
+ def encode_video(video_path, choose_fps=3, force_packing=None):
1417
+ def uniform_sample(l, n):
1418
+ gap = len(l) / n
1419
+ idxs = [int(i * gap + gap / 2) for i in range(n)]
1420
+ return [l[i] for i in idxs]
1421
+ vr = VideoReader(video_path, ctx=cpu(0))
1422
+ fps = vr.get_avg_fps()
1423
+ video_duration = len(vr) / fps
1424
+
1425
+ if choose_fps * int(video_duration) <= MAX_NUM_FRAMES:
1426
+ packing_nums = 1
1427
+ choose_frames = round(min(choose_fps, round(fps)) * min(MAX_NUM_FRAMES, video_duration))
1428
+
1429
+ else:
1430
+ packing_nums = math.ceil(video_duration * choose_fps / MAX_NUM_FRAMES)
1431
+ if packing_nums <= MAX_NUM_PACKING:
1432
+ choose_frames = round(video_duration * choose_fps)
1433
+ else:
1434
+ choose_frames = round(MAX_NUM_FRAMES * MAX_NUM_PACKING)
1435
+ packing_nums = MAX_NUM_PACKING
1436
 
1437
+ frame_idx = [i for i in range(0, len(vr))]
1438
+ frame_idx = np.array(uniform_sample(frame_idx, choose_frames))
1439
+
1440
+ if force_packing:
1441
+ packing_nums = min(force_packing, MAX_NUM_PACKING)
1442
+
1443
+ print(video_path, ' duration:', video_duration)
1444
+ print(f'get video frames={len(frame_idx)}, packing_nums={packing_nums}')
1445
+
1446
+ frames = vr.get_batch(frame_idx).asnumpy()
1447
+
1448
+ frame_idx_ts = frame_idx / fps
1449
+ scale = np.arange(0, video_duration, TIME_SCALE)
1450
+
1451
+ frame_ts_id = map_to_nearest_scale(frame_idx_ts, scale) / TIME_SCALE
1452
+ frame_ts_id = frame_ts_id.astype(np.int32)
1453
+
1454
+ assert len(frames) == len(frame_ts_id)
1455
+
1456
+ frames = [Image.fromarray(v.astype('uint8')).convert('RGB') for v in frames]
1457
+ frame_ts_id_group = group_array(frame_ts_id, packing_nums)
1458
+
1459
+ return frames, frame_ts_id_group
1460
+
1461
+
1462
+ video_path="video_test.mp4"
1463
+ fps = 5 # fps for video
1464
+ force_packing = None # You can set force_packing to ensure that 3D packing is forcibly enabled; otherwise, encode_video will dynamically set the packing quantity based on the duration.
1465
+ frames, frame_ts_id_group = encode_video(video_path, fps, force_packing=force_packing)
1466
 
1467
+ question = "Describe the video"
1468
  msgs = [
1469
+ {'role': 'user', 'content': frames + [question]},
 
 
1470
  ]
1471
 
1472
+
1473
  answer = model.chat(
1474
  msgs=msgs,
1475
+ tokenizer=tokenizer,
1476
+ use_image_id=False,
1477
+ max_slice_nums=1,
1478
+ temporal_ids=frame_ts_id_group
1479
+ )
1480
+ print(answer)
1481
+ ```
1482
+ </details>
1483
+
1484
+
1485
+ #### Speech and Audio Mode
1486
+
1487
+ Model initialization
1488
+
1489
+ ```python
1490
+ import torch
1491
+ import librosa
1492
+ from transformers import AutoModel, AutoTokenizer
1493
+
1494
+ model = AutoModel.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True,
1495
+ attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
1496
+ model = model.eval().cuda()
1497
+ tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True)
1498
+
1499
+ model.init_tts()
1500
+ model.tts.float()
1501
+ ```
1502
+
1503
+ <hr/>
1504
+
1505
+ ##### Mimick <!-- omit in toc -->
1506
+
1507
+ `Mimick` task reflects a model's end-to-end speech modeling capability. The model takes audio input, and outputs an ASR transcription and subsequently reconstructs the original audio with high similarity. The higher the similarity between the reconstructed audio and the original audio, the stronger the model's foundational capability in end-to-end speech modeling.
1508
+
1509
+ ```python
1510
+ mimick_prompt = "Please repeat each user's speech, including voice style and speech content."
1511
+ audio_input, _ = librosa.load('./assets/input_examples/Trump_WEF_2018_10s.mp3', sr=16000, mono=True) # load the audio to be mimicked
1512
+
1513
+ # `./assets/input_examples/fast-pace.wav`,
1514
+ # `./assets/input_examples/chi-english-1.wav`
1515
+ # `./assets/input_examples/exciting-emotion.wav`
1516
+ # for different aspects of speech-centric features.
1517
+
1518
+ msgs = [{'role': 'user', 'content': [mimick_prompt, audio_input]}]
1519
+ res = model.chat(
1520
+ msgs=msgs,
1521
+ tokenizer=tokenizer,
1522
+ sampling=True,
1523
+ max_new_tokens=128,
1524
+ use_tts_template=True,
1525
+ temperature=0.3,
1526
+ generate_audio=True,
1527
+ output_audio_path='output_mimick.wav', # save the tts result to output_audio_path
1528
+ )
1529
+ ```
1530
+
1531
+ <hr/>
1532
+
1533
+ ##### General Speech Conversation with Configurable Voices <!-- omit in toc -->
1534
+
1535
+ A general usage scenario of `MiniCPM-o-2.6` is role-playing a specific character based on the audio prompt. It will mimic the voice of the character to some extent and act like the character in text, including language style. In this mode, `MiniCPM-o-2.6` sounds **more natural and human-like**. Self-defined audio prompts can be used to customize the voice of the character in an end-to-end manner.
1536
+
1537
+
1538
+ ```python
1539
+ ref_audio, _ = librosa.load('./assets/input_examples/icl_20.wav', sr=16000, mono=True) # load the reference audio
1540
+ sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='audio_roleplay', language='en')
1541
+
1542
+ # round one
1543
+ user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]}
1544
+ msgs = [sys_prompt, user_question]
1545
+ res = model.chat(
1546
+ msgs=msgs,
1547
+ tokenizer=tokenizer,
1548
+ sampling=True,
1549
+ max_new_tokens=128,
1550
+ use_tts_template=True,
1551
+ generate_audio=True,
1552
+ temperature=0.3,
1553
+ output_audio_path='result_roleplay_round_1.wav',
1554
+ )
1555
+
1556
+ # round two
1557
+ history = msgs.append({'role': 'assistant', 'content': res})
1558
+ user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]}
1559
+ msgs = history.append(user_question)
1560
+ res = model.chat(
1561
+ msgs=msgs,
1562
+ tokenizer=tokenizer,
1563
+ sampling=True,
1564
+ max_new_tokens=128,
1565
+ use_tts_template=True,
1566
+ generate_audio=True,
1567
+ temperature=0.3,
1568
+ output_audio_path='result_roleplay_round_2.wav',
1569
+ )
1570
+ print(res)
1571
+ ```
1572
+
1573
+ <hr/>
1574
+
1575
+ ##### Speech Conversation as an AI Assistant <!-- omit in toc -->
1576
+
1577
+ An enhanced feature of `MiniCPM-o-2.6` is to act as an AI assistant, but only with limited choice of voices. In this mode, `MiniCPM-o-2.6` is **less human-like and more like a voice assistant**. In this mode, the model is more instruction-following. For demo, you are suggested to use `assistant_female_voice`, `assistant_male_voice`, and `assistant_default_female_voice`. Other voices may work but not as stable as the default voices.
1578
+
1579
+ *Please note that, `assistant_female_voice` and `assistant_male_voice` are more stable but sounds like robots, while `assistant_default_female_voice` is more human-alike but not stable, its voice often changes in multiple turns. We suggest you to try stable voices `assistant_female_voice` and `assistant_male_voice`.*
1580
+
1581
+ ```python
1582
+ ref_audio, _ = librosa.load('./assets/input_examples/assistant_female_voice.wav', sr=16000, mono=True) # or use `./assets/input_examples/assistant_male_voice.wav`
1583
+ sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='audio_assistant', language='en')
1584
+ user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]} # load the user's audio question
1585
+
1586
+ # round one
1587
+ msgs = [sys_prompt, user_question]
1588
+ res = model.chat(
1589
+ msgs=msgs,
1590
+ tokenizer=tokenizer,
1591
+ sampling=True,
1592
+ max_new_tokens=128,
1593
+ use_tts_template=True,
1594
+ generate_audio=True,
1595
+ temperature=0.3,
1596
+ output_audio_path='result_assistant_round_1.wav',
1597
+ )
1598
+
1599
+ # round two
1600
+ history = msgs.append({'role': 'assistant', 'content': res})
1601
+ user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]}
1602
+ msgs = history.append(user_question)
1603
+ res = model.chat(
1604
+ msgs=msgs,
1605
+ tokenizer=tokenizer,
1606
+ sampling=True,
1607
+ max_new_tokens=128,
1608
+ use_tts_template=True,
1609
+ generate_audio=True,
1610
+ temperature=0.3,
1611
+ output_audio_path='result_assistant_round_2.wav',
1612
+ )
1613
+ print(res)
1614
+ ```
1615
+
1616
+ <hr/>
1617
+
1618
+ ##### Instruction-to-Speech <!-- omit in toc -->
1619
+
1620
+ `MiniCPM-o-2.6` can also do Instruction-to-Speech, aka **Voice Creation**. You can describe a voice in detail, and the model will generate a voice that matches the description. For more Instruction-to-Speech sample instructions, you can refer to https://voxinstruct.github.io/VoxInstruct/.
1621
+
1622
+ ```python
1623
+ instruction = 'Speak like a male charming superstar, radiating confidence and style in every word.'
1624
+
1625
+ msgs = [{'role': 'user', 'content': [instruction]}]
1626
+
1627
+ res = model.chat(
1628
+ msgs=msgs,
1629
+ tokenizer=tokenizer,
1630
+ sampling=True,
1631
+ max_new_tokens=128,
1632
+ use_tts_template=True,
1633
+ generate_audio=True,
1634
+ temperature=0.3,
1635
+ output_audio_path='result_voice_creation.wav',
1636
+ )
1637
+ ```
1638
+
1639
+ <hr/>
1640
+
1641
+ ##### Voice Cloning <!-- omit in toc -->
1642
+
1643
+ `MiniCPM-o-2.6` can also do zero-shot text-to-speech, aka **Voice Cloning**. With this mode, model will act like a TTS model.
1644
+
1645
+
1646
+ ```python
1647
+ ref_audio, _ = librosa.load('./assets/input_examples/icl_20.wav', sr=16000, mono=True) # load the reference audio
1648
+ sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='voice_cloning', language='en')
1649
+ text_prompt = f"Please read the text below."
1650
+ user_question = {'role': 'user', 'content': [text_prompt, "content that you want to read"]}
1651
+
1652
+ msgs = [sys_prompt, user_question]
1653
+ res = model.chat(
1654
+ msgs=msgs,
1655
+ tokenizer=tokenizer,
1656
+ sampling=True,
1657
+ max_new_tokens=128,
1658
+ use_tts_template=True,
1659
+ generate_audio=True,
1660
+ temperature=0.3,
1661
+ output_audio_path='result_voice_cloning.wav',
1662
+ )
1663
+
1664
+ ```
1665
+
1666
+ <hr/>
1667
+
1668
+ ##### Addressing Various Audio Understanding Tasks <!-- omit in toc -->
1669
+
1670
+ `MiniCPM-o-2.6` can also be used to address various audio understanding tasks, such as ASR, speaker analysis, general audio captioning, and sound scene tagging.
1671
+
1672
+ For audio-to-text tasks, you can use the following prompts:
1673
+
1674
+ - ASR with ZH(same as AST en2zh): `请仔细听这段音频片段,并将其内容逐字记录。`
1675
+ - ASR with EN(same as AST zh2en): `Please listen to the audio snippet carefully and transcribe the content.`
1676
+ - Speaker Analysis: `Based on the speaker's content, speculate on their gender, condition, age range, and health status.`
1677
+ - General Audio Caption: `Summarize the main content of the audio.`
1678
+ - General Sound Scene Tagging: `Utilize one keyword to convey the audio's content or the associated scene.`
1679
+
1680
+ ```python
1681
+ task_prompt = "Please listen to the audio snippet carefully and transcribe the content." + "
1682
+ " # can change to other prompts.
1683
+ audio_input, _ = librosa.load('./assets/input_examples/audio_understanding.mp3', sr=16000, mono=True) # load the audio to be captioned
1684
+
1685
+ msgs = [{'role': 'user', 'content': [task_prompt, audio_input]}]
1686
+
1687
+ res = model.chat(
1688
+ msgs=msgs,
1689
+ tokenizer=tokenizer,
1690
+ sampling=True,
1691
+ max_new_tokens=128,
1692
+ use_tts_template=True,
1693
+ generate_audio=True,
1694
+ temperature=0.3,
1695
+ output_audio_path='result_audio_understanding.wav',
1696
+ )
1697
+ print(res)
1698
+ ```
1699
+
1700
+
1701
+
1702
+
1703
+ #### Multimodal Live Streaming
1704
+ <details>
1705
+ <summary> Click to view Python code running MiniCPM-o 2.6 with chat inference. </summary>
1706
+
1707
+ ```python
1708
+ import math
1709
+ import numpy as np
1710
+ from PIL import Image
1711
+ from moviepy.editor import VideoFileClip
1712
+ import tempfile
1713
+ import librosa
1714
+ import soundfile as sf
1715
+ import torch
1716
+ from transformers import AutoModel, AutoTokenizer
1717
+
1718
+ def get_video_chunk_content(video_path, flatten=True):
1719
+ video = VideoFileClip(video_path)
1720
+ print('video_duration:', video.duration)
1721
+
1722
+ with tempfile.NamedTemporaryFile(suffix=".wav", delete=True) as temp_audio_file:
1723
+ temp_audio_file_path = temp_audio_file.name
1724
+ video.audio.write_audiofile(temp_audio_file_path, codec="pcm_s16le", fps=16000)
1725
+ audio_np, sr = librosa.load(temp_audio_file_path, sr=16000, mono=True)
1726
+ num_units = math.ceil(video.duration)
1727
+
1728
+ # 1 frame + 1s audio chunk
1729
+ contents= []
1730
+ for i in range(num_units):
1731
+ frame = video.get_frame(i+1)
1732
+ image = Image.fromarray((frame).astype(np.uint8))
1733
+ audio = audio_np[sr*i:sr*(i+1)]
1734
+ if flatten:
1735
+ contents.extend(["<unit>", image, audio])
1736
+ else:
1737
+ contents.append(["<unit>", image, audio])
1738
+
1739
+ return contents
1740
+
1741
+
1742
+ model = AutoModel.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True,
1743
+ attn_implementation='sdpa', torch_dtype=torch.bfloat16)
1744
+ model = model.eval().cuda()
1745
+ tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True)
1746
+
1747
+ model.init_tts()
1748
+
1749
+ # If you are using an older version of PyTorch, you might encounter this issue "weight_norm_fwd_first_dim_kernel" not implemented for 'BFloat16', Please convert the TTS to float32 type.
1750
+ # model.tts.float()
1751
+
1752
+ # https://huggingface.co/openbmb/MiniCPM-o-2_6/blob/main/assets/Skiing.mp4
1753
+ video_path="assets/Skiing.mp4"
1754
+ sys_msg = model.get_sys_prompt(mode='omni', language='en')
1755
+ # if use voice clone prompt, please set ref_audio
1756
+ # ref_audio_path = '/path/to/ref_audio'
1757
+ # ref_audio, _ = librosa.load(ref_audio_path, sr=16000, mono=True)
1758
+ # sys_msg = model.get_sys_prompt(ref_audio=ref_audio, mode='omni', language='en')
1759
+
1760
+ contents = get_video_chunk_content(video_path)
1761
+ msg = {"role":"user", "content": contents}
1762
+ msgs = [sys_msg, msg]
1763
+
1764
+ # please set generate_audio=True and output_audio_path to save the tts result
1765
+ generate_audio = True
1766
+ output_audio_path = 'output.wav'
1767
+
1768
+ res = model.chat(
1769
+ msgs=msgs,
1770
+ tokenizer=tokenizer,
1771
+ sampling=True,
1772
+ temperature=0.5,
1773
+ max_new_tokens=4096,
1774
+ omni_input=True, # please set omni_input=True when omni inference
1775
+ use_tts_template=True,
1776
+ generate_audio=generate_audio,
1777
+ output_audio_path=output_audio_path,
1778
+ max_slice_nums=1,
1779
+ use_image_id=False,
1780
+ return_dict=True
1781
+ )
1782
+ print(res)
1783
+ ```
1784
+ </details>
1785
+
1786
+ <details>
1787
+ <summary> Click to view Python code running MiniCPM-o 2.6 with streaming inference. </summary>
1788
+
1789
+ Note: The streaming inference has a slight performance degradation because the audio encoding is not global.
1790
+ ```python
1791
+ # a new conversation need reset session first, it will reset the kv-cache
1792
+ model.reset_session()
1793
+
1794
+ contents = get_video_chunk_content(video_path, flatten=False)
1795
+ session_id = '123'
1796
+ generate_audio = True
1797
+
1798
+ # 1. prefill system prompt
1799
+ res = model.streaming_prefill(
1800
+ session_id=session_id,
1801
+ msgs=[sys_msg],
1802
  tokenizer=tokenizer
1803
  )
1804
+
1805
+ # 2. prefill video/audio chunks
1806
+ for content in contents:
1807
+ msgs = [{"role":"user", "content": content}]
1808
+ res = model.streaming_prefill(
1809
+ session_id=session_id,
1810
+ msgs=msgs,
1811
+ tokenizer=tokenizer
1812
+ )
1813
+
1814
+ # 3. generate
1815
+ res = model.streaming_generate(
1816
+ session_id=session_id,
1817
+ tokenizer=tokenizer,
1818
+ temperature=0.5,
1819
+ generate_audio=generate_audio
1820
+ )
1821
+
1822
+ audios = []
1823
+ text = ""
1824
+
1825
+ if generate_audio:
1826
+ for r in res:
1827
+ audio_wav = r.audio_wav
1828
+ sampling_rate = r.sampling_rate
1829
+ txt = r.text
1830
+
1831
+ audios.append(audio_wav)
1832
+ text += txt
1833
+
1834
+ res = np.concatenate(audios)
1835
+ sf.write("output.wav", res, samplerate=sampling_rate)
1836
+ print("text:", text)
1837
+ print("audio saved to output.wav")
1838
+ else:
1839
+ for r in res:
1840
+ text += r['text']
1841
+ print("text:", text)
1842
+ ```
1843
+
1844
+ </details>
1845
+
1846
+ ### Inference on Multiple GPUs
1847
+ You can run MiniCPM-Llama3-V 2.5 on multiple low VRAM GPUs (12 GB or 16 GB) by distributing the model's layers across multiple GPUs. Please refer to this [tutorial](https://github.com/OpenBMB/MiniCPM-V/blob/main/docs/inference_on_multiple_gpus.md) for detailed instructions on how to load the model and inference using multiple low VRAM GPUs.
1848
+
1849
+
1850
+ ### Inference on Mac
1851
+ <details>
1852
+ <summary>Click to view an example, to run MiniCPM-Llama3-V 2.5 on 💻 Mac with MPS (Apple silicon or AMD GPUs). </summary>
1853
+
1854
+ ```python
1855
+ # test.py Need more than 16GB memory.
1856
+ import torch
1857
+ from PIL import Image
1858
+ from transformers import AutoModel, AutoTokenizer
1859
+
1860
+ model = AutoModel.from_pretrained('openbmb/MiniCPM-Llama3-V-2_5', trust_remote_code=True, low_cpu_mem_usage=True)
1861
+ model = model.to(device='mps')
1862
+
1863
+ tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-Llama3-V-2_5', trust_remote_code=True)
1864
+ model.eval()
1865
+
1866
+ image = Image.open('./assets/hk_OCR.jpg').convert('RGB')
1867
+ question = 'Where is this photo taken?'
1868
+ msgs = [{'role': 'user', 'content': question}]
1869
+
1870
+ answer, context, _ = model.chat(
1871
+ image=image,
1872
+ msgs=msgs,
1873
+ context=None,
1874
+ tokenizer=tokenizer,
1875
+ sampling=True
1876
+ )
1877
  print(answer)
1878
  ```
1879
+ Run with command:
1880
+ ```shell
1881
+ PYTORCH_ENABLE_MPS_FALLBACK=1 python test.py
1882
+ ```
1883
  </details>
1884
 
1885
 
1886
+ ### Efficient Inference with llama.cpp, Ollama, vLLM
1887
+
1888
+ See [our fork of llama.cpp](https://github.com/OpenBMB/llama.cpp/tree/minicpmv-main/examples/llava/README-minicpmv2.6.md) for more detail. This implementation supports smooth inference of 16~18 token/s on iPad (test environment:iPad Pro + M4).
1889
+
1890
+ See [our fork of Ollama](https://github.com/OpenBMB/ollama/blob/minicpm-v2.6/examples/minicpm-v2.6/README.md) for more detail. This implementation supports smooth inference of 16~18 token/s on iPad (test environment:iPad Pro + M4).
1891
+
1892
+
1893
+ <details>
1894
+ <summary> vLLM now officially supports MiniCPM-V 2.6, MiniCPM-Llama3-V 2.5 and MiniCPM-V 2.0. And you can use our fork to run MiniCPM-o 2.6 for now. Click to see. </summary>
1895
+
1896
+ 1. Install vLLM(>=0.7.1):
1897
+ ```shell
1898
+ pip install vllm
1899
+ ```
1900
+
1901
+ 2. Run Example:
1902
+ * [Vision Language](https://docs.vllm.ai/en/latest/getting_started/examples/vision_language.html)
1903
+ * [Audio Language](https://docs.vllm.ai/en/latest/getting_started/examples/audio_language.html)
1904
+ </details>
1905
+
1906
+ ## Fine-tuning
1907
+
1908
+ ### Simple Fine-tuning <!-- omit in toc -->
1909
+
1910
+ We support simple fine-tuning with Hugging Face for MiniCPM-o 2.6, MiniCPM-V 2.6, MiniCPM-Llama3-V 2.5 and MiniCPM-V 2.0.
1911
+
1912
+ [Reference Document](./finetune/readme.md)
1913
+
1914
+
1915
+ ### With Align-Anything <!-- omit in toc -->
1916
+
1917
+ We support fine-tuning MiniCPM-o 2.6 by PKU-Alignment Team (both vision and audio, SFT and DPO) with the [Align-Anything framework](https://github.com/PKU-Alignment/align-anything). Align-Anything is a scalable framework that aims to align any-modality large models with human intentions, open-sourcing the [datasets, models and benchmarks](https://huggingface.co/datasets/PKU-Alignment/align-anything). Benefiting from its concise and modular design, it supports 30+ open-source benchmarks, 40+ models and algorithms including SFT, SimPO, RLHF, *etc*. It also provides 30+ directly runnable scripts, making it suitable for beginners to quickly get started.
1918
+
1919
+ Best Practices: [MiniCPM-o 2.6](https://github.com/PKU-Alignment/align-anything/tree/main/scripts).
1920
+
1921
+
1922
+ ### With LLaMA-Factory <!-- omit in toc -->
1923
+
1924
+ We support fine-tuning MiniCPM-o 2.6 and MiniCPM-V 2.6 with the LLaMA-Factory framework. LLaMA-Factory provides a solution for flexibly customizing the fine-tuning (Lora/Full/Qlora) of 200+ LLMs without the need for coding through the built-in web UI LLaMABoard. It supports various training methods like sft/ppo/dpo/kto and advanced algorithms like Galore/BAdam/LLaMA-Pro/Pissa/LongLoRA.
1925
+
1926
+
1927
+ Best Practices: [MiniCPM-o 2.6 | MiniCPM-V 2.6](./docs/llamafactory_train_and_infer.md).
1928
+
1929
+
1930
+ ### With the SWIFT Framework <!-- omit in toc -->
1931
+
1932
+ We now support MiniCPM-V series fine-tuning with the SWIFT framework. SWIFT supports training, inference, evaluation and deployment of nearly 200 LLMs and MLLMs . It supports the lightweight training solutions provided by PEFT and a complete Adapters Library including techniques such as NEFTune, LoRA+ and LLaMA-PRO.
1933
+
1934
+ Best Practices:[MiniCPM-V 1.0](https://github.com/modelscope/swift/blob/main/docs/source/Multi-Modal/minicpm-v最佳实践.md), [MiniCPM-V 2.0](https://github.com/modelscope/swift/blob/main/docs/source/Multi-Modal/minicpm-v-2最佳实践.md), [MiniCPM-V 2.6](https://github.com/modelscope/ms-swift/issues/1613).
1935
+
1936
+
1937
+ ## Awesome work using MiniCPM-V & MiniCPM-o
1938
+ - [text-extract-api](https://github.com/CatchTheTornado/text-extract-api): Document extraction API using OCRs and Ollama supported models ![GitHub Repo stars](https://img.shields.io/github/stars/CatchTheTornado/text-extract-api)
1939
+ - [comfyui_LLM_party](https://github.com/heshengtao/comfyui_LLM_party): Build LLM workflows and integrate into existing image workflows ![GitHub Repo stars](https://img.shields.io/github/stars/heshengtao/comfyui_LLM_party)
1940
+ - [Ollama-OCR](https://github.com/imanoop7/Ollama-OCR): OCR package uses vlms through Ollama to extract text from images and PDF ![GitHub Repo stars](https://img.shields.io/github/stars/imanoop7/Ollama-OCR)
1941
+ - [comfyui-mixlab-nodes](https://github.com/MixLabPro/comfyui-mixlab-nodes): ComfyUI node suite supports Workflow-to-APP、GPT&3D and more ![GitHub Repo stars](https://img.shields.io/github/stars/MixLabPro/comfyui-mixlab-nodes)
1942
+ - [OpenAvatarChat](https://github.com/HumanAIGC-Engineering/OpenAvatarChat): Interactive digital human conversation implementation on single PC ![GitHub Repo stars](https://img.shields.io/github/stars/HumanAIGC-Engineering/OpenAvatarChat)
1943
+ - [pensieve](https://github.com/arkohut/pensieve): A privacy-focused passive recording project by recording screen content ![GitHub Repo stars](https://img.shields.io/github/stars/arkohut/pensieve)
1944
+ - [paperless-gpt](https://github.com/icereed/paperless-gpt): Use LLMs to handle paperless-ngx, AI-powered titles, tags and OCR ![GitHub Repo stars](https://img.shields.io/github/stars/icereed/paperless-gpt)
1945
+ - [Neuro](https://github.com/kimjammer/Neuro): A recreation of Neuro-Sama, but running on local models on consumer hardware ![GitHub Repo stars](https://img.shields.io/github/stars/kimjammer/Neuro)
1946
+
1947
+ ## FAQs
1948
+ Click here to view the [FAQs](./docs/faqs.md)
1949
+
1950
+ ## Limitations
1951
+ As an experimental trial, we find MiniCPM-o 2.6 has notable limitations worth further investigation and improvement.
1952
+ - **Unstable speech output.** The speech generation can be flawed with noisy backgrounds and unmeaningful sounds.
1953
+ - **Repeated response.** The model tends to repeat its response when encountering similar consecutive user queries.
1954
+ - **High-latency on Web Demo.** Users may experience unusual high-latency when using web demo hosted on overseas servers. We recommend deploying the demo locally or with good network connections.
1955
+
1956
+ ## Model License <!-- omit in toc -->
1957
+
1958
+ * The MiniCPM-o/V model weights and code are open-sourced under the [Apache-2.0](https://github.com/OpenBMB/MiniCPM-V/blob/main/LICENSE) license.
1959
+
1960
+ * To help us better understand and support our users, we would deeply appreciate it if you could consider optionally filling out a brief registration ["questionnaire"](https://modelbest.feishu.cn/share/base/form/shrcnpV5ZT9EJ6xYjh3Kx0J6v8g).
1961
+
1962
+ ## Statement <!-- omit in toc -->
1963
+
1964
+ As MLLMs, MiniCPM-o/V models generate content by learning a large number of multimodal corpora, but they cannot comprehend, express personal opinions, or make value judgements. Anything generated by MiniCPM-o/V models does not represent the views and positions of the model developers
1965
+
1966
+ We will not be liable for any problems arising from the use of MiniCPM-o/V models, including but not limited to data security issues, risk of public opinion, or any risks and problems arising from the misdirection, misuse, dissemination, or misuse of the model.
1967
+
1968
+
1969
+ ## Institutions <!-- omit in toc -->
1970
+
1971
+ This project is developed by the following institutions:
1972
+
1973
+ - <img src="assets/thunlp.png" width="28px"> [THUNLP](https://nlp.csai.tsinghua.edu.cn/)
1974
+ - <img src="assets/modelbest.png" width="28px"> [ModelBest](https://modelbest.cn/)
1975
+
1976
+ ## 🌟 Star History <!-- omit in toc -->
1977
+
1978
+
1979
+ <table align="center">
1980
+ <p align="center">
1981
+ <img src="assets/star-history-25-09-02.png"/>
1982
+ </p>
1983
+ </table>
1984
+
1985
+ <!-- <picture>
1986
+ <source
1987
+ media="(prefers-color-scheme: dark)"
1988
+ srcset="
1989
+ https://api.star-history.com/svg?repos=OpenBMB/MiniCPM-o&type=Date&theme=dark
1990
+ "
1991
+ />
1992
+ <source
1993
+ media="(prefers-color-scheme: light)"
1994
+ srcset="
1995
+ https://api.star-history.com/svg?repos=OpenBMB/MiniCPM-o&type=Date
1996
+ "
1997
+ />
1998
+ <img
1999
+ alt="Star History Chart"
2000
+ src="https://api.star-history.com/svg?repos=OpenBMB/MiniCPM-o&type=Date"
2001
+ />
2002
+ </picture> -->
2003
 
2004
+ ## Key Techniques and Other Multimodal Projects <!-- omit in toc -->
 
 
2005
 
2006
+ 👏 Welcome to explore key techniques of MiniCPM-o/V and other multimodal projects of our team:
2007
 
2008
+ [VisCPM](https://github.com/OpenBMB/VisCPM/tree/main) | [RLPR](https://github.com/OpenBMB/RLPR) | [RLHF-V](https://github.com/RLHF-V/RLHF-V) | [LLaVA-UHD](https://github.com/thunlp/LLaVA-UHD) | [RLAIF-V](https://github.com/RLHF-V/RLAIF-V)
2009
 
 
2010
 
2011
+ ## Citation <!-- omit in toc -->
2012
 
2013
+ If you find our model/code/paper helpful, please consider citing our papers 📝 and staring us ⭐️!
2014
 
2015
  ```bib
2016
  @misc{yu2025minicpmv45cookingefficient,