tc-mb commited on
Commit
218145d
·
verified ·
1 Parent(s): 77019fa

update readme

Browse files
Files changed (1) hide show
  1. README.md +65 -11
README.md CHANGED
@@ -4,6 +4,9 @@ pipeline_tag: image-text-to-text
4
  tags:
5
  - minicpm-v
6
  - multimodal
 
 
 
7
  ---
8
 
9
  A Pocket-Sized MLLM for Ultra-Efficient Image and Video Understanding on Your Phone
@@ -11,6 +14,12 @@ A Pocket-Sized MLLM for Ultra-Efficient Image and Video Understanding on Your Ph
11
  [GitHub](https://github.com/OpenBMB/MiniCPM-o) | [CookBook](https://github.com/OpenSQZ/MiniCPM-V-CookBook) | [Demo](https://huggingface.co/spaces/openbmb/MiniCPM-V-4.6-Thinking-Demo) |
12
  [Feishu (Lark)](https://raw.githubusercontent.com/openbmb/MiniCPM-V/main/assets/feishu_qrcode.png)
13
 
 
 
 
 
 
 
14
  ## MiniCPM-V 4.6 Thinking
15
 
16
  **MiniCPM-V 4.6 Thinking** is the long chain-of-thought reasoning variant of [MiniCPM-V 4.6](https://huggingface.co/openbmb/MiniCPM-V-4.6). It generates an explicit reasoning trace before producing the final answer, substantially boosting performance on complex multimodal reasoning, math, and OCR-heavy tasks, while keeping the same edge-friendly architecture (SigLIP2-400M vision encoder + Qwen3.5-0.8B LLM) and the mixed 4x/16x visual token compression of MiniCPM-V 4.6.
@@ -198,9 +207,9 @@ You can customize image/video processing by passing additional parameters to `ap
198
  |-----------|---------|------------|-------------|
199
  | `downsample_mode` | `"16x"` | Image & Video | Visual token downsampling. `"16x"` merges tokens for efficiency; `"4x"` keeps 4× more tokens for finer detail. Must also be passed to `generate()`. |
200
  | `max_slice_nums` | `9` | Image & Video | Maximum number of slices when splitting a high-resolution image. Higher values preserve more detail for large images. Recommended: `36` for image, `1` for video. |
201
- | `max_num_frames` | `128` | Video only | Maximum number of main frames sampled from the video. |
202
- | `stack_frames` | `1` | Video only | Total sample points per second. `1` = main frame only (no stacking). `N` (N>1) = 1 main frame + N−1 sub-frames per second; the sub-frames are composited into a grid image and interleaved with main frames. Recommended: `3` or `5`. |
203
- | `use_image_id` | `True` | Image & Video | Whether to prepend `<image_id>N</image_id>` tags before each image/frame placeholder. Recommended: `True` for image, `False` for video. |
204
 
205
  > **Note:** `downsample_mode` must be passed to **both** `apply_chat_template` (for correct placeholder count) and `generate` (for the vision encoder). All other parameters only need to be passed to `apply_chat_template`.
206
 
@@ -235,6 +244,57 @@ curl -s http://localhost:8000/v1/chat/completions \
235
  }'
236
  ```
237
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
238
  #### Handling Escaped Newlines in Model Outputs <!-- omit in toc -->
239
 
240
  In some cases, the model might output escaped newline characters `\n` as string literals instead of actual newlines. To render the text correctly, especially in UI layers, you can use the following utility function. This function carefully replaces literal `\n` with real newlines while protecting scenarios where `\n` has specific semantic meaning.
@@ -269,6 +329,7 @@ def normalize_response_text(text: str) -> str:
269
 
270
  We have adapted MiniCPM-V 4.6 for deployment on **iOS, Android, and HarmonyOS** platforms, with **all edge adaptation code fully open-sourced**. Developers can reproduce the on-device experience in just a few steps. Visit our [edge deployment repository](https://github.com/OpenBMB/MiniCPM-V-edge-demo) for platform-specific build guides, or go to the [download page](https://github.com/OpenBMB/MiniCPM-V-edge-demo/blob/main/DOWNLOAD.md) to try pre-built apps directly.
271
 
 
272
  #### Use MiniCPM-V 4.6 in Other Inference and Training Frameworks <!-- omit in toc -->
273
 
274
  MiniCPM-V 4.6 supports multiple inference and training frameworks. Below are quick-start commands for each. For full details, see our [Cookbook](https://github.com/OpenSQZ/MiniCPM-V-CookBook).
@@ -407,7 +468,7 @@ swift sft --model_type minicpm-v-4_6 --dataset <your-dataset>
407
 
408
  **Technical Reports:** [MiniCPM-o 4.5](https://huggingface.co/papers/2604.27393) | [MiniCPM-V 4.5](https://arxiv.org/abs/2509.18154) | [MiniCPM-o 2.6](https://openbmb.notion.site/MiniCPM-o-2-6-A-GPT-4o-Level-MLLM-for-Vision-Speech-and-Multimodal-Live-Streaming-on-Your-Phone-185ede1b7a558042b5d5e45e6b237da9) | [MiniCPM-Llama3-V 2.5](https://arxiv.org/abs/2408.01800) | [MiniCPM-V 2.0](https://openbmb.vercel.app/minicpm-v-2)
409
 
410
- **Other Multimodal Projects:** [VisCPM](https://github.com/OpenBMB/VisCPM/tree/main) | [RLPR](https://github.com/OpenBMB/RLPR) | [RLHF-V](https://github.com/RLHF-V/RLHF-V) | [LLaVA-UHD](https://github.com/thunlp/LLaVA-UHD) | [RLAIF-V](https://github.com/RLHF-V/RLAIF-V)
411
 
412
 
413
  ## Citation <!-- omit in toc -->
@@ -415,13 +476,6 @@ swift sft --model_type minicpm-v-4_6 --dataset <your-dataset>
415
  If you find our model/code/paper helpful, please consider citing our papers 📝 and staring us ⭐️!
416
 
417
  ```bib
418
- @misc{cui2026minicpmo45realtimefullduplex,
419
- title={MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction},
420
- author={Junbo Cui and Bokai Xu and Chongyi Wang and Tianyu Yu and Weiyue Sun and Yingjing Xu and Tianran Wang and Zhihui He and Wenshuo Ma and Tianchi Cai and others},
421
- year={2026},
422
- url={https://arxiv.org/abs/2604.27393},
423
- }
424
-
425
  @proceedings{yu2025minicpmv45cookingefficient,
426
  title={MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe},
427
  author={Tianyu Yu and Zefan Wang and Chongyi Wang and Fuwei Huang and Wenshuo Ma and Zhihui He and Tianchi Cai and Weize Chen and Yuxiang Huang and Yuanqian Zhao and others},
 
4
  tags:
5
  - minicpm-v
6
  - multimodal
7
+ - On-Device Model
8
+ - lightweight
9
+ library_name: transformers
10
  ---
11
 
12
  A Pocket-Sized MLLM for Ultra-Efficient Image and Video Understanding on Your Phone
 
14
  [GitHub](https://github.com/OpenBMB/MiniCPM-o) | [CookBook](https://github.com/OpenSQZ/MiniCPM-V-CookBook) | [Demo](https://huggingface.co/spaces/openbmb/MiniCPM-V-4.6-Thinking-Demo) |
15
  [Feishu (Lark)](https://raw.githubusercontent.com/openbmb/MiniCPM-V/main/assets/feishu_qrcode.png)
16
 
17
+ ## News
18
+
19
+ * [2026.05.17] ⭐️⭐️⭐️ We release the API service of MiniCPM-V 4.6, with a **public free API key** together! Try [it](https://github.com/OpenBMB/MiniCPM-V/blob/main/docs/api.md) now.
20
+
21
+
22
+
23
  ## MiniCPM-V 4.6 Thinking
24
 
25
  **MiniCPM-V 4.6 Thinking** is the long chain-of-thought reasoning variant of [MiniCPM-V 4.6](https://huggingface.co/openbmb/MiniCPM-V-4.6). It generates an explicit reasoning trace before producing the final answer, substantially boosting performance on complex multimodal reasoning, math, and OCR-heavy tasks, while keeping the same edge-friendly architecture (SigLIP2-400M vision encoder + Qwen3.5-0.8B LLM) and the mixed 4x/16x visual token compression of MiniCPM-V 4.6.
 
207
  |-----------|---------|------------|-------------|
208
  | `downsample_mode` | `"16x"` | Image & Video | Visual token downsampling. `"16x"` merges tokens for efficiency; `"4x"` keeps 4× more tokens for finer detail. Must also be passed to `generate()`. |
209
  | `max_slice_nums` | `9` | Image & Video | Maximum number of slices when splitting a high-resolution image. Higher values preserve more detail for large images. Recommended: `36` for image, `1` for video. |
210
+ | `max_num_frames` | `128` | Video only | The `max_num_frames` parameter dynamically controls the temporal context length and prevents VRAM overflow: <br> **Short Videos** (duration ≤ `max_num_frames` sec): The processor defaults to **1 FPS**, capturing second-by-second details without hitting the upper limit. <br> **Long Videos** (duration > `max_num_frames` sec): The processor automatically switches to **uniform sampling**, selecting exactly `max_num_frames` evenly spaced across the entire timeline. |
211
+ | `stack_frames` | `1` | Video only | Total sample points per second. `1` = main frame only (no stacking). `N` (N>1) = 1 main frame + N−1 sub-frames per second; the sub-frames are composited into a grid image and interleaved with main frames. Recommended setting is `1` for short videos, and `3` or `5` for long videos. |
212
+ | `use_image_id` | `True` | Image & Video | Whether to prepend `<image_id>N</image_id>` tags before each image/frame placeholder. Set `True` for image, `False` for video. |
213
 
214
  > **Note:** `downsample_mode` must be passed to **both** `apply_chat_template` (for correct placeholder count) and `generate` (for the vision encoder). All other parameters only need to be passed to `apply_chat_template`.
215
 
 
244
  }'
245
  ```
246
 
247
+ Tool calling example:
248
+
249
+ ```bash
250
+ curl -s http://localhost:8000/v1/chat/completions -H 'Content-Type: application/json' -d '{
251
+ "model": "openbmb/MiniCPM-V-4.6-Thinking",
252
+ "messages": [{"role": "user", "content": [
253
+ {"type": "text", "text": "the weather of Beijing"}
254
+ ]}],
255
+ "tools": [{
256
+ "type": "function",
257
+ "function": {
258
+ "name": "get_weather",
259
+ "description": "Get the current weather for a given location",
260
+ "parameters": {
261
+ "type": "object",
262
+ "properties": {
263
+ "location": {"type": "string", "description": "City name"}
264
+ },
265
+ "required": ["location"]
266
+ }
267
+ }
268
+ }]
269
+ }'
270
+ ```
271
+
272
+ The model returns a natural-language explanation followed by a structured <tool_call> block embedded in the content field. Note that a dedicated tool call parser for this format has not yet been added to the transformers library, so the tool calls need to be extracted manually via regex for now.
273
+
274
+ ```
275
+ {
276
+ "id": "f4f09c7d-8045-4cb1-ade9-07aa5dee637d",
277
+ "choices": [
278
+ {
279
+ "finish_reason": "stop",
280
+ "index": 0,
281
+ "message": {
282
+ "content": "I need to check the current weather for Beijing, so I will call the get_weather function.\n\n<tool_call>\n<function=get_weather>\n<parameter=location>\nBeijing\n</parameter>\n</function>\n</tool_call>",
283
+ "role": "assistant"
284
+ }
285
+ }
286
+ ],
287
+ "created": 1778748859,
288
+ "model": "openbmb/MiniCPM-V-4.6-Thinking@main",
289
+ "object": "chat.completion",
290
+ "usage": {
291
+ "completion_tokens": 47,
292
+ "prompt_tokens": 283,
293
+ "total_tokens": 330
294
+ }
295
+ }
296
+ ```
297
+
298
  #### Handling Escaped Newlines in Model Outputs <!-- omit in toc -->
299
 
300
  In some cases, the model might output escaped newline characters `\n` as string literals instead of actual newlines. To render the text correctly, especially in UI layers, you can use the following utility function. This function carefully replaces literal `\n` with real newlines while protecting scenarios where `\n` has specific semantic meaning.
 
329
 
330
  We have adapted MiniCPM-V 4.6 for deployment on **iOS, Android, and HarmonyOS** platforms, with **all edge adaptation code fully open-sourced**. Developers can reproduce the on-device experience in just a few steps. Visit our [edge deployment repository](https://github.com/OpenBMB/MiniCPM-V-edge-demo) for platform-specific build guides, or go to the [download page](https://github.com/OpenBMB/MiniCPM-V-edge-demo/blob/main/DOWNLOAD.md) to try pre-built apps directly.
331
 
332
+ <a id="inference-and-training"></a>
333
  #### Use MiniCPM-V 4.6 in Other Inference and Training Frameworks <!-- omit in toc -->
334
 
335
  MiniCPM-V 4.6 supports multiple inference and training frameworks. Below are quick-start commands for each. For full details, see our [Cookbook](https://github.com/OpenSQZ/MiniCPM-V-CookBook).
 
468
 
469
  **Technical Reports:** [MiniCPM-o 4.5](https://huggingface.co/papers/2604.27393) | [MiniCPM-V 4.5](https://arxiv.org/abs/2509.18154) | [MiniCPM-o 2.6](https://openbmb.notion.site/MiniCPM-o-2-6-A-GPT-4o-Level-MLLM-for-Vision-Speech-and-Multimodal-Live-Streaming-on-Your-Phone-185ede1b7a558042b5d5e45e6b237da9) | [MiniCPM-Llama3-V 2.5](https://arxiv.org/abs/2408.01800) | [MiniCPM-V 2.0](https://openbmb.vercel.app/minicpm-v-2)
470
 
471
+ **Other Multimodal Projects:** [VisCPM](https://github.com/OpenBMB/VisCPM/tree/main) | [RLPR](https://github.com/OpenBMB/RLPR) | [RLHF-V](https://github.com/RLHF-V/RLHF-V) | [LLaVA-UHD](https://github.com/thunlp/LLaVA-UHD) | [RLAIF-V](https://github.com/RLHF-V/RLAIF-V) | [LLaVA-UHD-v4](https://arxiv.org/abs/2605.08985 )
472
 
473
 
474
  ## Citation <!-- omit in toc -->
 
476
  If you find our model/code/paper helpful, please consider citing our papers 📝 and staring us ⭐️!
477
 
478
  ```bib
 
 
 
 
 
 
 
479
  @proceedings{yu2025minicpmv45cookingefficient,
480
  title={MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe},
481
  author={Tianyu Yu and Zefan Wang and Chongyi Wang and Fuwei Huang and Wenshuo Ma and Zhihui He and Tianchi Cai and Weize Chen and Yuxiang Huang and Yuanqian Zhao and others},