1f commited on
Commit
4b70ea9
·
verified ·
1 Parent(s): ad1a50a

Add files using upload-large-folder tool

Browse files
Files changed (20) hide show
  1. r1-a/response_generation/minicpm/MiniCPM-o/assets/minicpmv-llama3-v2.5/temp +1 -0
  2. r1-a/response_generation/minicpm/MiniCPM-o/assets/wechat.png +0 -0
  3. r1-a/response_generation/minicpm/MiniCPM-o/assets/worldmap_ck.jpg +0 -0
  4. r1-a/response_generation/minicpm/MiniCPM-o/assets/xinferenc_demo_image/xinference_register_model2.png +0 -0
  5. r1-a/response_generation/minicpm/MiniCPM-o/assets/xinferenc_demo_image/xinference_search_box.png +0 -0
  6. r1-a/response_generation/minicpm/MiniCPM-o/assets/xinferenc_demo_image/xinference_webui_button.png +0 -0
  7. r1-a/response_generation/minicpm/MiniCPM-o/assets/zhihu.webp +0 -0
  8. r1-a/response_generation/minicpm/MiniCPM-o/docs/best_practice_summary.md +23 -0
  9. r1-a/response_generation/minicpm/MiniCPM-o/docs/best_practice_summary_zh.md +22 -0
  10. r1-a/response_generation/minicpm/MiniCPM-o/docs/compare_with_phi-3_vision.md +27 -0
  11. r1-a/response_generation/minicpm/MiniCPM-o/docs/faqs.md +30 -0
  12. r1-a/response_generation/minicpm/MiniCPM-o/docs/inference_on_multiple_gpus.md +159 -0
  13. r1-a/response_generation/minicpm/MiniCPM-o/docs/llamafactory_train_and_infer.md +445 -0
  14. r1-a/response_generation/minicpm/MiniCPM-o/docs/minicpm_llama3_v2dot5.md +333 -0
  15. r1-a/response_generation/minicpm/MiniCPM-o/docs/minicpm_v1.md +214 -0
  16. r1-a/response_generation/minicpm/MiniCPM-o/docs/minicpm_v2.md +299 -0
  17. r1-a/response_generation/minicpm/MiniCPM-o/docs/minicpm_v2dot6.md +945 -0
  18. r1-a/response_generation/minicpm/MiniCPM-o/docs/omnilmm.md +183 -0
  19. r1-a/response_generation/minicpm/MiniCPM-o/docs/omnilmm_en.md +155 -0
  20. r1-a/response_generation/minicpm/MiniCPM-o/docs/swift_train_and_infer.md +135 -0
r1-a/response_generation/minicpm/MiniCPM-o/assets/minicpmv-llama3-v2.5/temp ADDED
@@ -0,0 +1 @@
 
 
1
+
r1-a/response_generation/minicpm/MiniCPM-o/assets/wechat.png ADDED
r1-a/response_generation/minicpm/MiniCPM-o/assets/worldmap_ck.jpg ADDED
r1-a/response_generation/minicpm/MiniCPM-o/assets/xinferenc_demo_image/xinference_register_model2.png ADDED
r1-a/response_generation/minicpm/MiniCPM-o/assets/xinferenc_demo_image/xinference_search_box.png ADDED
r1-a/response_generation/minicpm/MiniCPM-o/assets/xinferenc_demo_image/xinference_webui_button.png ADDED
r1-a/response_generation/minicpm/MiniCPM-o/assets/zhihu.webp ADDED
r1-a/response_generation/minicpm/MiniCPM-o/docs/best_practice_summary.md ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # MiniCPM-V Best Practices
2
+
3
+ **MiniCPM-V** is a series of end-side multimodal LLMs (MLLMs) designed for vision-language understanding. The models take image, video and text as inputs and provide high-quality text output, aiming to achieve **strong performance and efficient deployment**. The most notable models in this series currently include MiniCPM-Llama3-V 2.5 and MiniCPM-V 2.6. The following sections provide detailed tutorials and guidance for each version of the MiniCPM-V models.
4
+
5
+
6
+ ## MiniCPM-V 2.6
7
+
8
+ MiniCPM-V 2.6 is the latest and most capable model in the MiniCPM-V series. With a total of 8B parameters, the model **surpasses GPT-4V in single image, multi-image and video understanding**. It outperforms **GPT-4o mini, Gemini 1.5 Pro and Claude 3.5 Sonnet** in single image understanding, and advances MiniCPM-Llama3-V 2.5's features such as strong OCR capability, trustworthy behavior, multilingual support, and end-side deployment. Due to its superior token density, MiniCPM-V 2.6 can for the first time support real-time video understanding on end-side devices such as iPad.
9
+
10
+ * [Deployment Tutorial](https://modelbest.feishu.cn/wiki/C2BWw4ZP0iCDy7kkCPCcX2BHnOf)
11
+ * [Training Tutorial](https://modelbest.feishu.cn/wiki/GeHMwLMa0i2FhUkV0f6cz3HWnV1)
12
+ * [Quantization Tutorial](https://modelbest.feishu.cn/wiki/YvsPwnPwWiqUjlkmW0scQ76TnBb)
13
+
14
+ ## MiniCPM-Llama3-V 2.5
15
+
16
+ MiniCPM-Llama3-V 2.5 is built on SigLip-400M and Llama3-8B-Instruct with a total of 8B parameters. It exhibits a significant performance improvement over MiniCPM-V 2.0.
17
+
18
+ * [Quantization Tutorial](https://modelbest.feishu.cn/wiki/Kc7ywV4X1ipSaAkuPFOc9SFun8b)
19
+ * [Training Tutorial](https://modelbest.feishu.cn/wiki/UpSiw63o9iGDhIklmwScX4a6nhW)
20
+ * [End-side Deployment](https://modelbest.feishu.cn/wiki/Lwr9wpOQdinr6AkLzHrc9LlgnJD)
21
+ * [Deployment Tutorial](https://modelbest.feishu.cn/wiki/LTOKw3Hz7il9kGkCLX9czsennKe)
22
+ * [HD Decoding Tutorial](https://modelbest.feishu.cn/wiki/Ug8iwdXfhiHVsDk2gGEco6xnnVg)
23
+ * [Model Structure](https://modelbest.feishu.cn/wiki/ACtAw9bOgiBQ9lkWyafcvtVEnQf)
r1-a/response_generation/minicpm/MiniCPM-o/docs/best_practice_summary_zh.md ADDED
@@ -0,0 +1,22 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # MiniCPM-V 最佳实践
2
+
3
+ **MiniCPM-V**是面向图文理解的端侧多模态大模型系列。该系列模型接受图像和文本输入,并提供高质量的文本输出。自2024年2月以来,我们共发布了5个版本模型,旨在实现**领先的性能和高效的部署**,目前该系列最值得关注的模型包括:
4
+
5
+ ## MiniCPM-V 2.6
6
+
7
+ MiniCPM-V系列的最新、性能最佳模型。总参数量 8B,单图、多图和视频理解性能**超越了 GPT-4V**。在单图理解上,它取得了优于 **GPT-4o mini、Gemini 1.5 Pro 和 Claude 3.5 Sonnet** 等商用闭源模型的表现,并进一步优化了 MiniCPM-Llama3-V 2.5 的 OCR、可信行为、多语言支持以及端侧部署等诸多特性。基于其领先的视觉 token 密度,MiniCPM-V 2.6 成为了首个支持在 iPad 等端侧设备上进行实时视频理解的多模态大模型。
8
+
9
+ * [部署教程](https://modelbest.feishu.cn/wiki/LZxLwp4Lzi29vXklYLFchwN5nCf)
10
+ * [训练教程](https://modelbest.feishu.cn/wiki/HvfLwYzlIihqzXkmeCdczs6onmd)
11
+ * [量化教程](https://modelbest.feishu.cn/wiki/PAsHw6N6xiEy0DkJWpJcIocRnz9)
12
+
13
+ ## MiniCPM-Llama3-V 2.5
14
+
15
+ MiniCPM-Llama3-V 2.5 基于 SigLip-400M 和 Llama3-8B-Instruct 构建,总共有 80 亿参数。其性能相比 MiniCPM-V 2.0 有了显著提升。
16
+
17
+ * [量化教程](https://modelbest.feishu.cn/wiki/O0KTwQV5piUPzTkRXl9cSFyHnQb)
18
+ * [训练教程](https://modelbest.feishu.cn/wiki/MPkPwvONEiZm3BkWMnyc83Tin4d)
19
+ * [端侧部署](https://modelbest.feishu.cn/wiki/CZZJw1EDGitSSZka664cZwbWnrb)
20
+ * [部署教程](https://modelbest.feishu.cn/wiki/BcHIwjOLGihJXCkkSdMc2WhbnZf)
21
+ * [高清解码教程](https://modelbest.feishu.cn/wiki/L0ajwm8VAiiPY6kDZfJce3B7nRg)
22
+ * [模型结构](https://modelbest.feishu.cn/wiki/X15nwGzqpioxlikbi2RcXDpJnjd)
r1-a/response_generation/minicpm/MiniCPM-o/docs/compare_with_phi-3_vision.md ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## Phi-3-vision-128K-Instruct vs MiniCPM-Llama3-V 2.5
2
+
3
+ Comparison results of Phi-3-vision-128K-Instruct and MiniCPM-Llama3-V 2.5, regarding the model size, hardware requirements, and performances.
4
+ With int4 quantization, MiniCPM-Llama3-V 2.5 delivers **smooth inference with only 8GB of GPU memory**. In most benchmarks, MiniCPM-Llama3-V 2.5 achieves **better performance** compared with Phi-3-vision-128K-Instruct. Moreover, MiniCPM-Llama3-V 2.5 also exhibits **lower latency and better throughtput even without quantization**.
5
+
6
+ 我们提供了从模型参数、硬件需求、性能指标等方面对比 Phi-3-vision-128K-Instruct 和 MiniCPM-Llama3-V 2.5 的结果。通过 int4 量化,MiniCPM-Llama3-V 2.5 **仅需 8GB 显存即可推理**。在大多数评测集上, MiniCPM-Llama3-V 2.5 相比于 Phi-3-vision-128K-Instruct 都展现出了**更优的性能表现**。 即使未经量化,MiniCPM-Llama3-V 2.5 的**推理延迟和吞吐率也都更具优势**。
7
+
8
+ <div align="center">
9
+ <img src="../assets/phi3_vision_comparison.jpg" width="85%" />
10
+ </div>
11
+
12
+
13
+
14
+ ### Multilingual Capabilities(多语言能力对比)
15
+
16
+
17
+ MiniCPM-Llama3-V 2.5 exhibits **stronger multilingual** capabilities compared with Phi-3-vision-128K-Instruct on LLaVA Bench.
18
+
19
+ MiniCPM-Llama3-V 2.5 在对话和推理评测榜单 LLaVA Bench 上展现出了比 Phi-3-vision-128K-Instruct **更强的多语言的性能**。
20
+
21
+ <div align="center">
22
+ <img src="../assets/llavabench_compare_phi3.png" width="100%" />
23
+ <br>
24
+ Evaluation results of multilingual LLaVA Bench
25
+ <br>
26
+ 多语言LLaVA Bench评测结果
27
+ </div>
r1-a/response_generation/minicpm/MiniCPM-o/docs/faqs.md ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ### FAQs
2
+
3
+ <details>
4
+ <summary>Q: How to choose between sampling or beam search for inference </summary>
5
+
6
+ In various scenarios, the quality of results obtained from beam search and sampling decoding strategies can vary. You can determine your decoding strategy based on the following aspects:
7
+
8
+ If you have the following needs, consider using sampling decoding:
9
+
10
+ 1. You require faster inference speed.
11
+ 2. You wish for a streaming generation approach.
12
+ 3. Your task necessitates some open-ended responses.
13
+
14
+ If your task is about providing deterministic answers, you might want to experiment with beam search to see if it can achieve better outcomes.
15
+ </details>
16
+
17
+
18
+ <details>
19
+ <summary>Q: How to ensure that the model generates results of sufficient length</summary>
20
+
21
+ We've observed that during multi-language inference on MiniCPM-V 2.6, the generation sometimes ends prematurely. You can improve the results by passing a `min_new_tokens` parameter.
22
+ ```python
23
+ res = model.chat(
24
+ image=None,
25
+ msgs=msgs,
26
+ tokenizer=tokenizer,
27
+ min_new_tokens=100
28
+ )
29
+ ```
30
+ </details>
r1-a/response_generation/minicpm/MiniCPM-o/docs/inference_on_multiple_gpus.md ADDED
@@ -0,0 +1,159 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## Using MiniCPM-Llama3-V-2_5 with Multiple GPUs
2
+
3
+ Due to the limited memory capacity of a single GPU, it may be impossible to load the entire MiniCPMV model (the model weights account for 18 GiB) onto one device for inference (assume one gpu only has 12 GiB or 16 GiB GPU memory). To address this limitation, multi-GPU inference can be employed, where the model's layers are distributed across multiple GPUs.
4
+
5
+ A minimal modification method can be used to achieve this distribution, ensuring that the layers are assigned to different GPUs with minimal changes to the original model structure.
6
+
7
+ To implement this, we utilize features provided by `accelerate` library.
8
+
9
+ Install all requirements of MiniCPM-Llama3-V-2_5, additionally, you also need to install `accelerate`.
10
+
11
+ ```bash
12
+ pip install accelerate
13
+ ```
14
+
15
+ <br/>
16
+
17
+ ### Example Usage for `2x16GiB` GPUs
18
+
19
+ Now we consider a demo for two GPUs, where each GPU has 16 GiB GPU memory.
20
+
21
+ 1. Import necessary libraries.
22
+
23
+ ```python
24
+ from PIL import Image
25
+ import torch
26
+ from transformers import AutoConfig, AutoModel, AutoTokenizer
27
+ from accelerate import init_empty_weights, infer_auto_device_map, load_checkpoint_in_model, dispatch_model
28
+ ```
29
+
30
+ 2. Download model weights.
31
+
32
+ ```python
33
+ MODEL_PATH = '/local/path/to/MiniCPM-Llama3-V-2_5' # you can download in advance or use `openbmb/MiniCPM-Llama3-V-2_5`
34
+ ```
35
+
36
+ 3. Determine the distribution of layers on multiple GPUs.
37
+
38
+ ```python
39
+ max_memory_each_gpu = '10GiB' # Define the maximum memory to use on each gpu, here we suggest using a balanced value, because the weight is not everything, the intermediate activation value also uses GPU memory (10GiB < 16GiB)
40
+
41
+ gpu_device_ids = [0, 1] # Define which gpu to use (now we have two GPUs, each has 16GiB memory)
42
+
43
+ no_split_module_classes = ["LlamaDecoderLayer"]
44
+
45
+ max_memory = {
46
+ device_id: max_memory_each_gpu for device_id in gpu_device_ids
47
+ }
48
+
49
+ config = AutoConfig.from_pretrained(
50
+ MODEL_PATH,
51
+ trust_remote_code=True
52
+ )
53
+
54
+ tokenizer = AutoTokenizer.from_pretrained(
55
+ MODEL_PATH,
56
+ trust_remote_code=True
57
+ )
58
+
59
+ with init_empty_weights():
60
+ model = AutoModel.from_config(
61
+ config,
62
+ torch_dtype=torch.float16,
63
+ trust_remote_code=True
64
+ )
65
+
66
+ device_map = infer_auto_device_map(
67
+ model,
68
+ max_memory=max_memory, no_split_module_classes=no_split_module_classes
69
+ )
70
+
71
+ print("auto determined device_map", device_map)
72
+
73
+ # Here we want to make sure the input and output layer are all on the first gpu to avoid any modifications to original inference script.
74
+
75
+ device_map["llm.model.embed_tokens"] = 0
76
+ device_map["llm.model.layers.0"] = 0
77
+ device_map["llm.lm_head"] = 0
78
+ device_map["vpm"] = 0
79
+ device_map["resampler"] = 0
80
+
81
+ print("modified device_map", device_map)
82
+
83
+ ```
84
+
85
+ You may see this output:
86
+
87
+ ```
88
+ modified device_map OrderedDict([('llm.model.embed_tokens', 0), ('llm.model.layers.0', 0), ('llm.model.layers.1', 0), ('llm.model.layers.2', 0), ('llm.model.layers.3', 0), ('llm.model.layers.4', 0), ('llm.model.layers.5', 0), ('llm.model.layers.6', 0), ('llm.model.layers.7', 0), ('llm.model.layers.8', 0), ('llm.model.layers.9', 0), ('llm.model.layers.10', 0), ('llm.model.layers.11', 0), ('llm.model.layers.12', 0), ('llm.model.layers.13', 0), ('llm.model.layers.14', 0), ('llm.model.layers.15', 0), ('llm.model.layers.16', 1), ('llm.model.layers.17', 1), ('llm.model.layers.18', 1), ('llm.model.layers.19', 1), ('llm.model.layers.20', 1), ('llm.model.layers.21', 1), ('llm.model.layers.22', 1), ('llm.model.layers.23', 1), ('llm.model.layers.24', 1), ('llm.model.layers.25', 1), ('llm.model.layers.26', 1), ('llm.model.layers.27', 1), ('llm.model.layers.28', 1), ('llm.model.layers.29', 1), ('llm.model.layers.30', 1), ('llm.model.layers.31', 1), ('llm.model.norm', 1), ('llm.lm_head', 0), ('vpm', 0), ('resampler', 0)])
89
+ ```
90
+
91
+ 4. Next, use the `device_map` to dispatch the model layers to corresponding gpus.
92
+
93
+ ```python
94
+ load_checkpoint_in_model(
95
+ model,
96
+ MODEL_PATH,
97
+ device_map=device_map)
98
+
99
+ model = dispatch_model(
100
+ model,
101
+ device_map=device_map
102
+ )
103
+
104
+ torch.set_grad_enabled(False)
105
+
106
+ model.eval()
107
+ ```
108
+
109
+
110
+
111
+ 5. Chat!
112
+
113
+ ```python
114
+ image_path = '/local/path/to/test.png'
115
+
116
+ response = model.chat(
117
+ image=Image.open(image_path).convert("RGB"),
118
+ msgs=[
119
+ {
120
+ "role": "user",
121
+ "content": "guess what I am doing?"
122
+ }
123
+ ],
124
+ tokenizer=tokenizer
125
+ )
126
+
127
+ print(response)
128
+ ```
129
+
130
+ In this case the OOM (CUDA out of memory) problem may be eliminated. We have tested that:
131
+
132
+ - it works well for `3000` text input tokens and `1000` text output tokens.
133
+ - it works well for a high-resolution input image.
134
+
135
+ <br/>
136
+
137
+ ### Usage for general cases
138
+
139
+ It is similar to the previous example, but you may consider modifying these two variables.
140
+
141
+ ```python
142
+ max_memory_each_gpu = '10GiB' # Define the maximum memory to use on each gpu, here we suggest using a balanced value, because the weight is not everything, the intermediate activation value also uses GPU memory (10GiB < 16GiB)
143
+
144
+ gpu_device_ids = [0, 1, ...] # Define which gpu to use (now we have two GPUs, each has 16GiB memory)
145
+ ```
146
+
147
+ You may use the following shell script to monitor the memory usage during inference, if there is OOM, try to reduce `max_memory_each_gpu`, to make memory pressure more balanced across all gpus.
148
+
149
+ ```bash
150
+ watch -n1 nvidia-smi
151
+ ```
152
+
153
+ <br/>
154
+
155
+
156
+ ### References
157
+
158
+ [Ref 1](https://zhuanlan.zhihu.com/p/639850033)
159
+
r1-a/response_generation/minicpm/MiniCPM-o/docs/llamafactory_train_and_infer.md ADDED
@@ -0,0 +1,445 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Best Practice with LLaMA-Factory
2
+
3
+ ## Contents <!-- omit in toc -->
4
+
5
+ - [Support Models](#Support-Models)
6
+ - [LLaMA-Factory Installation](#LLaMA-Factory-Installation)
7
+ - [Dataset Prepare](#Dataset-Prepare)
8
+ - [Image Dataset](#Image-Dataset)
9
+ - [Video Dataset](#Video-Dataset)
10
+ - [Audio Dataset](#Audio-Dataset)
11
+ - [Lora Fine-Tuning](#Lora-Fine-Tuning)
12
+ - [Full Parameters Fine-Tuning](#Full-Parameters-Fine-Tuning)
13
+ - [Inference](#Inference)
14
+
15
+ ## Support Models
16
+ * [openbmb/MiniCPM-o-2_6](https://huggingface.co/openbmb/MiniCPM-o-2_6)
17
+ * [openbmb/MiniCPM-V-2_6](https://huggingface.co/openbmb/MiniCPM-V-2_6)
18
+
19
+ ## LLaMA-Factory Installation
20
+
21
+ You can install LLaMA-Factory using commands below.
22
+
23
+ ```
24
+ git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
25
+ cd LLaMA-Factory
26
+ pip install -e ".[torch,metrics,deepspeed,minicpm_v]"
27
+ mkdir configs # let's put all yaml files here
28
+ ```
29
+
30
+ ## Dataset Prepare
31
+
32
+ Refer to [data/dataset_info.json](https://github.com/hiyouga/LLaMA-Factory/blob/main/data/dataset_info.json) to add your customised dataset. Let's use the two existing demo datasets `mllm_demo`, `mllm_video_demo` and `mllm_audio_demo` as examples (audio is only for MiniCPM-o-2.6).
33
+
34
+ ### Image Dataset
35
+
36
+ Refer to image sft demo data: [data/mllm_demo.json](https://github.com/hiyouga/LLaMA-Factory/blob/main/data/mllm_demo.json)
37
+
38
+ <details>
39
+ <summary>
40
+ <b>data/mllm_demo.json</b>
41
+ </summary>
42
+
43
+ ```json
44
+ [
45
+ {
46
+ "messages": [
47
+ {
48
+ "content": "<image>Who are they?",
49
+ "role": "user"
50
+ },
51
+ {
52
+ "content": "They're Kane and Gretzka from Bayern Munich.",
53
+ "role": "assistant"
54
+ },
55
+ {
56
+ "content": "What are they doing?",
57
+ "role": "user"
58
+ },
59
+ {
60
+ "content": "They are celebrating on the soccer field.",
61
+ "role": "assistant"
62
+ }
63
+ ],
64
+ "images": [
65
+ "mllm_demo_data/1.jpg"
66
+ ]
67
+ },
68
+ {
69
+ "messages": [
70
+ {
71
+ "content": "<image>Who is he?",
72
+ "role": "user"
73
+ },
74
+ {
75
+ "content": "He's Thomas Muller from Bayern Munich.",
76
+ "role": "assistant"
77
+ },
78
+ {
79
+ "content": "Why is he on the ground?",
80
+ "role": "user"
81
+ },
82
+ {
83
+ "content": "Because he's sliding on his knees to celebrate.",
84
+ "role": "assistant"
85
+ }
86
+ ],
87
+ "images": [
88
+ "mllm_demo_data/2.jpg"
89
+ ]
90
+ },
91
+ {
92
+ "messages": [
93
+ {
94
+ "content": "<image>Please describe this image",
95
+ "role": "user"
96
+ },
97
+ {
98
+ "content": "Chinese astronaut Gui Haichao is giving a speech.",
99
+ "role": "assistant"
100
+ },
101
+ {
102
+ "content": "What has he accomplished?",
103
+ "role": "user"
104
+ },
105
+ {
106
+ "content": "He was appointed to be a payload specialist on Shenzhou 16 mission in June 2022, thus becoming the first Chinese civilian of Group 3 in space on 30 May 2023. He is responsible for the on-orbit operation of space science experimental payloads.",
107
+ "role": "assistant"
108
+ }
109
+ ],
110
+ "images": [
111
+ "mllm_demo_data/3.jpg"
112
+ ]
113
+ }
114
+ ]
115
+ ```
116
+
117
+ </details>
118
+
119
+
120
+ ### Video Dataset
121
+
122
+ Refer to video sft demo data: [data/mllm_video_demo.json](https://github.com/hiyouga/LLaMA-Factory/blob/main/data/mllm_video_demo.json)
123
+
124
+ <details>
125
+ <summary>
126
+ <b>data/mllm_video_demo.json</b>
127
+ </summary>
128
+
129
+ ```json
130
+ [
131
+ {
132
+ "messages": [
133
+ {
134
+ "content": "<video>Why is this video funny?",
135
+ "role": "user"
136
+ },
137
+ {
138
+ "content": "Because a baby is reading, and he is so cute!",
139
+ "role": "assistant"
140
+ }
141
+ ],
142
+ "videos": [
143
+ "mllm_demo_data/1.mp4"
144
+ ]
145
+ },
146
+ {
147
+ "messages": [
148
+ {
149
+ "content": "<video>What is she doing?",
150
+ "role": "user"
151
+ },
152
+ {
153
+ "content": "She is cooking.",
154
+ "role": "assistant"
155
+ }
156
+ ],
157
+ "videos": [
158
+ "mllm_demo_data/2.avi"
159
+ ]
160
+ },
161
+ {
162
+ "messages": [
163
+ {
164
+ "content": "<video>What's in the video?",
165
+ "role": "user"
166
+ },
167
+ {
168
+ "content": "A baby is playing in the living room.",
169
+ "role": "assistant"
170
+ }
171
+ ],
172
+ "videos": [
173
+ "mllm_demo_data/3.mp4"
174
+ ]
175
+ }
176
+ ]
177
+ ```
178
+
179
+ </details>
180
+
181
+ ### Audio Dataset
182
+
183
+ Refer to audio sft demo data: [data/mllm_audio_demo.json](https://github.com/hiyouga/LLaMA-Factory/blob/main/data/mllm_audio_demo.json)
184
+
185
+ <details>
186
+ <summary>
187
+ <b>data/mllm_audio_demo.json</b>
188
+ </summary>
189
+
190
+ ```json
191
+ [
192
+ {
193
+ "messages": [
194
+ {
195
+ "content": "<audio>What's that sound?",
196
+ "role": "user"
197
+ },
198
+ {
199
+ "content": "It is the sound of glass shattering.",
200
+ "role": "assistant"
201
+ }
202
+ ],
203
+ "audios": [
204
+ "mllm_demo_data/1.mp3"
205
+ ]
206
+ },
207
+ {
208
+ "messages": [
209
+ {
210
+ "content": "<audio>What can you hear?",
211
+ "role": "user"
212
+ },
213
+ {
214
+ "content": "A woman is coughing.",
215
+ "role": "assistant"
216
+ }
217
+ ],
218
+ "audios": [
219
+ "mllm_demo_data/2.wav"
220
+ ]
221
+ },
222
+ {
223
+ "messages": [
224
+ {
225
+ "content": "<audio>What does the person say?",
226
+ "role": "user"
227
+ },
228
+ {
229
+ "content": "Mister Quiller is the apostle of the middle classes and we are glad to welcome his gospel.",
230
+ "role": "assistant"
231
+ }
232
+ ],
233
+ "audios": [
234
+ "mllm_demo_data/3.flac"
235
+ ]
236
+ }
237
+ ]
238
+ ```
239
+
240
+ </details>
241
+
242
+ ## Lora Fine-Tuning
243
+
244
+ We can use one command to do lora sft:
245
+
246
+ ```shell
247
+ CUDA_VISIBLE_DEVICES=0 llamafactory-cli train configs/minicpmo_2_6_lora_sft.yaml
248
+ ```
249
+
250
+ <details>
251
+ <summary>
252
+ <b>configs/minicpmo_2_6_lora_sft.yaml</b>
253
+ </summary>
254
+
255
+ ```yaml
256
+ ### model
257
+ model_name_or_path: openbmb/MiniCPM-o-2_6 # MiniCPM-o-2_6 MiniCPM-V-2_6
258
+ trust_remote_code: true
259
+
260
+ ### method
261
+ stage: sft
262
+ do_train: true
263
+ finetuning_type: lora
264
+ lora_target: q_proj,v_proj
265
+
266
+ ### dataset
267
+ dataset: mllm_demo # mllm_demo mllm_video_demo mllm_audio_demo
268
+ template: minicpm_o # minicpm_o minicpm_v
269
+ cutoff_len: 3072
270
+ max_samples: 1000
271
+ overwrite_cache: true
272
+ preprocessing_num_workers: 16
273
+
274
+ ### output
275
+ output_dir: saves/minicpmo_2_6/lora/sft
276
+ logging_steps: 1
277
+ save_steps: 100
278
+ plot_loss: true
279
+ overwrite_output_dir: true
280
+ save_total_limit: 10
281
+
282
+ ### train
283
+ per_device_train_batch_size: 2
284
+ gradient_accumulation_steps: 1
285
+ learning_rate: 1.0e-5
286
+ num_train_epochs: 20.0
287
+ lr_scheduler_type: cosine
288
+ warmup_ratio: 0.1
289
+ bf16: true
290
+ ddp_timeout: 180000000
291
+ save_only_model: true
292
+
293
+ ### eval
294
+ do_eval: false
295
+ ```
296
+
297
+ </details>
298
+
299
+ ### Lora Model Export
300
+
301
+ One command to export lora model
302
+
303
+ ```shell
304
+ llamafactory-cli export configs/minicpmo_2_6_lora_export.yaml
305
+ ```
306
+
307
+ <details>
308
+ <summary>
309
+ <b>configs/minicpmo_2_6_lora_export.yaml</b>
310
+ </summary>
311
+
312
+ ```yaml
313
+ ### model
314
+ model_name_or_path: openbmb/MiniCPM-o-2_6 # MiniCPM-o-2_6 MiniCPM-V-2_6
315
+ adapter_name_or_path: saves/minicpmo_2_6/lora/sft
316
+ template: minicpm_o # minicpm_o minicpm_v
317
+ finetuning_type: lora
318
+ trust_remote_code: true
319
+
320
+ ### export
321
+ export_dir: models/minicpmo_2_6_lora_sft
322
+ export_size: 2
323
+ export_device: cpu
324
+ export_legacy_format: false
325
+ ```
326
+
327
+ </details>
328
+
329
+ ## Full Parameters Fine-Tuning
330
+
331
+ We can use one command to do full sft:
332
+
333
+ ```shell
334
+ llamafactory-cli train configs/minicpmo_2_6_full_sft.yaml
335
+ ```
336
+
337
+ <details>
338
+ <summary>
339
+ <b>configs/minicpmo_2_6_full_sft.yaml</b>
340
+ </summary>
341
+
342
+ ```yaml
343
+ ### model
344
+ model_name_or_path: openbmb/MiniCPM-o-2_6 # MiniCPM-o-2_6 MiniCPM-V-2_6
345
+ trust_remote_code: true
346
+ freeze_vision_tower: true
347
+ print_param_status: true
348
+ flash_attn: fa2
349
+
350
+ ### method
351
+ stage: sft
352
+ do_train: true
353
+ finetuning_type: full
354
+ deepspeed: configs/deepspeed/ds_z2_config.json
355
+
356
+ ### dataset
357
+ dataset: mllm_demo # mllm_demo mllm_video_demo
358
+ template: minicpm_o # minicpm_o minicpm_v
359
+ cutoff_len: 3072
360
+ max_samples: 1000
361
+ overwrite_cache: true
362
+ preprocessing_num_workers: 16
363
+
364
+ ### output
365
+ output_dir: saves/minicpmo_2_6/full/sft
366
+ logging_steps: 1
367
+ save_steps: 100
368
+ plot_loss: true
369
+ overwrite_output_dir: true
370
+ save_total_limit: 10
371
+
372
+ ### train
373
+ per_device_train_batch_size: 2
374
+ gradient_accumulation_steps: 1
375
+ learning_rate: 1.0e-5
376
+ num_train_epochs: 20.0
377
+ lr_scheduler_type: cosine
378
+ warmup_ratio: 0.1
379
+ bf16: true
380
+ ddp_timeout: 180000000
381
+ save_only_model: true
382
+
383
+ ### eval
384
+ do_eval: false
385
+ ```
386
+ </details>
387
+
388
+ ## Inference
389
+
390
+ ### Web UI ChatBox
391
+
392
+ Refer [LLaMA-Factory doc](https://github.com/hiyouga/LLaMA-Factory/tree/main/examples#inferring-lora-fine-tuned-models) for more inference usages.
393
+
394
+ For example, we can use one command to run web chat:
395
+
396
+ ```shell
397
+ CUDA_VISIBLE_DEVICES=0 llamafactory-cli webchat configs/minicpmo_2_6_infer.yaml
398
+ ```
399
+
400
+ <details>
401
+ <summary>
402
+ <b>configs/minicpmo_2_6_infer.yaml</b>
403
+ </summary>
404
+
405
+ ```yaml
406
+ model_name_or_path: saves/minicpmo_2_6/full/sft
407
+ template: minicpm_o # minicpm_o minicpm_v
408
+ infer_backend: huggingface
409
+ trust_remote_code: true
410
+ ```
411
+ </details>
412
+
413
+ ### Official Code
414
+ You can also use official code to inference
415
+
416
+ <details>
417
+ <summary>
418
+ <b>official inference code</b>
419
+ </summary>
420
+
421
+ ```python
422
+ # test.py
423
+ import torch
424
+ from PIL import Image
425
+ from transformers import AutoModel, AutoTokenizer
426
+
427
+ model_id = "saves/minicpmo_2_6/full/sft"
428
+ model = AutoModel.from_pretrained(model_id, trust_remote_code=True,
429
+ attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
430
+ model = model.eval().cuda()
431
+ tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
432
+
433
+ image = Image.open('data/mllm_demo_data/1.jpg').convert('RGB')
434
+ question = 'Who are they??'
435
+ msgs = [{'role': 'user', 'content': [image, question]}]
436
+
437
+ res = model.chat(
438
+ image=None,
439
+ msgs=msgs,
440
+ tokenizer=tokenizer
441
+ )
442
+ print(res)
443
+ ```
444
+
445
+ </details>
r1-a/response_generation/minicpm/MiniCPM-o/docs/minicpm_llama3_v2dot5.md ADDED
@@ -0,0 +1,333 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## MiniCPM-Llama3-V 2.5
2
+
3
+ > Archieve at: 2025-01-13
4
+
5
+
6
+ **MiniCPM-Llama3-V 2.5** is the latest model in the MiniCPM-V series. The model is built on SigLip-400M and Llama3-8B-Instruct with a total of 8B parameters. It exhibits a significant performance improvement over MiniCPM-V 2.0. Notable features of MiniCPM-Llama3-V 2.5 include:
7
+
8
+ - 🔥 **Leading Performance.**
9
+ MiniCPM-Llama3-V 2.5 has achieved an average score of 65.1 on OpenCompass, a comprehensive evaluation over 11 popular benchmarks. **With only 8B parameters, it surpasses widely used proprietary models like GPT-4V-1106, Gemini Pro, Claude 3 and Qwen-VL-Max** and greatly outperforms other Llama 3-based MLLMs.
10
+
11
+ - 💪 **Strong OCR Capabilities.**
12
+ MiniCPM-Llama3-V 2.5 can process images with any aspect ratio and up to 1.8 million pixels (e.g., 1344x1344), achieving a **700+ score on OCRBench, surpassing proprietary models such as GPT-4o, GPT-4V-0409, Qwen-VL-Max and Gemini Pro**. Based on recent user feedback, MiniCPM-Llama3-V 2.5 has now enhanced full-text OCR extraction, table-to-markdown conversion, and other high-utility capabilities, and has further strengthened its instruction-following and complex reasoning abilities, enhancing multimodal interaction experiences.
13
+
14
+ - 🏆 **Trustworthy Behavior.**
15
+ Leveraging the latest [RLAIF-V](https://github.com/RLHF-V/RLAIF-V/) method (the newest technique in the [RLHF-V](https://github.com/RLHF-V) [CVPR'24] series), MiniCPM-Llama3-V 2.5 exhibits more trustworthy behavior. It achieves a **10.3%** hallucination rate on Object HalBench, lower than GPT-4V-1106 (13.6%), achieving the best-level performance within the open-source community. [Data released](https://huggingface.co/datasets/openbmb/RLAIF-V-Dataset).
16
+
17
+ - 🌏 **Multilingual Support.**
18
+ Thanks to the strong multilingual capabilities of Llama 3 and the cross-lingual generalization technique from [VisCPM](https://github.com/OpenBMB/VisCPM), MiniCPM-Llama3-V 2.5 extends its bilingual (Chinese-English) multimodal capabilities to **over 30 languages including German, French, Spanish, Italian, Korean etc.** [All Supported Languages](./assets/minicpm-llama-v-2-5_languages.md).
19
+
20
+ - 🚀 **Efficient Deployment.**
21
+ MiniCPM-Llama3-V 2.5 systematically employs **model quantization, CPU optimizations, NPU optimizations and compilation optimizations**, achieving high-efficiency deployment on end-side devices. For mobile phones with Qualcomm chips, we have integrated the NPU acceleration framework QNN into llama.cpp for the first time. After systematic optimization, MiniCPM-Llama3-V 2.5 has realized a **150x acceleration in end-side MLLM image encoding** and a **3x speedup in language decoding**.
22
+
23
+ - 💫 **Easy Usage.**
24
+ MiniCPM-Llama3-V 2.5 can be easily used in various ways: (1) [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-v2.5/examples/minicpmv/README.md) and [ollama](https://github.com/OpenBMB/ollama/tree/minicpm-v2.5/examples/minicpm-v2.5) support for efficient CPU inference on local devices, (2) [GGUF](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5-gguf) format quantized models in 16 sizes, (3) efficient [LoRA](https://github.com/OpenBMB/MiniCPM-V/tree/main/finetune#lora-finetuning) fine-tuning with only 2 V100 GPUs, (4) [streaming output](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5#usage), (5) quick local WebUI demo setup with [Gradio](https://github.com/OpenBMB/MiniCPM-V/blob/main/web_demo_2.5.py) and [Streamlit](https://github.com/OpenBMB/MiniCPM-V/blob/main/web_demo_streamlit-2_5.py), and (6) interactive demos on [HuggingFace Spaces](https://huggingface.co/spaces/openbmb/MiniCPM-Llama3-V-2_5).
25
+
26
+ ### Evaluation <!-- omit in toc -->
27
+
28
+ <div align="center">
29
+ <img src=../assets/MiniCPM-Llama3-V-2.5-peformance.png width=66% />
30
+ </div>
31
+ <details>
32
+ <summary>Click to view results on TextVQA, DocVQA, OCRBench, OpenCompass, MME, MMBench, MMMU, MathVista, LLaVA Bench, RealWorld QA, Object HalBench. </summary>
33
+ <div align="center">
34
+
35
+ <table style="margin: 0px auto;">
36
+ <thead>
37
+ <tr>
38
+ <th align="left">Model</th>
39
+ <th>Size</th>
40
+ <th>OCRBench</th>
41
+ <th>TextVQA val</th>
42
+ <th>DocVQA test</th>
43
+ <th>Open-Compass</th>
44
+ <th>MME</th>
45
+ <th>MMB test (en)</th>
46
+ <th>MMB test (cn)</th>
47
+ <th>MMMU val</th>
48
+ <th>Math-Vista</th>
49
+ <th>LLaVA Bench</th>
50
+ <th>RealWorld QA</th>
51
+ <th>Object HalBench</th>
52
+ </tr>
53
+ </thead>
54
+ <tbody align="center">
55
+ <tr>
56
+ <td colspan="14" align="left"><strong>Proprietary</strong></td>
57
+ </tr>
58
+ <tr>
59
+ <td nowrap="nowrap" align="left">Gemini Pro</td>
60
+ <td>-</td>
61
+ <td>680</td>
62
+ <td>74.6</td>
63
+ <td>88.1</td>
64
+ <td>62.9</td>
65
+ <td>2148.9</td>
66
+ <td>73.6</td>
67
+ <td>74.3</td>
68
+ <td>48.9</td>
69
+ <td>45.8</td>
70
+ <td>79.9</td>
71
+ <td>60.4</td>
72
+ <td>-</td>
73
+ </tr>
74
+ <tr>
75
+ <td nowrap="nowrap" align="left">GPT-4V (2023.11.06)</td>
76
+ <td>-</td>
77
+ <td>645</td>
78
+ <td>78.0</td>
79
+ <td>88.4</td>
80
+ <td>63.5</td>
81
+ <td>1771.5</td>
82
+ <td>77.0</td>
83
+ <td>74.4</td>
84
+ <td>53.8</td>
85
+ <td>47.8</td>
86
+ <td>93.1</td>
87
+ <td>63.0</td>
88
+ <td>86.4</td>
89
+ </tr>
90
+ <tr>
91
+ <td colspan="14" align="left"><strong>Open-source</strong></td>
92
+ </tr>
93
+ <tr>
94
+ <td nowrap="nowrap" align="left">Mini-Gemini</td>
95
+ <td>2.2B</td>
96
+ <td>-</td>
97
+ <td>56.2</td>
98
+ <td>34.2*</td>
99
+ <td>-</td>
100
+ <td>1653.0</td>
101
+ <td>-</td>
102
+ <td>-</td>
103
+ <td>31.7</td>
104
+ <td>-</td>
105
+ <td>-</td>
106
+ <td>-</td>
107
+ <td>-</td>
108
+ </tr>
109
+ <tr>
110
+ <td nowrap="nowrap" align="left">Qwen-VL-Chat</td>
111
+ <td>9.6B</td>
112
+ <td>488</td>
113
+ <td>61.5</td>
114
+ <td>62.6</td>
115
+ <td>51.6</td>
116
+ <td>1860.0</td>
117
+ <td>61.8</td>
118
+ <td>56.3</td>
119
+ <td>37.0</td>
120
+ <td>33.8</td>
121
+ <td>67.7</td>
122
+ <td>49.3</td>
123
+ <td>56.2</td>
124
+ </tr>
125
+ <tr>
126
+ <td nowrap="nowrap" align="left">DeepSeek-VL-7B</td>
127
+ <td>7.3B</td>
128
+ <td>435</td>
129
+ <td>64.7*</td>
130
+ <td>47.0*</td>
131
+ <td>54.6</td>
132
+ <td>1765.4</td>
133
+ <td>73.8</td>
134
+ <td>71.4</td>
135
+ <td>38.3</td>
136
+ <td>36.8</td>
137
+ <td>77.8</td>
138
+ <td>54.2</td>
139
+ <td>-</td>
140
+ </tr>
141
+ <tr>
142
+ <td nowrap="nowrap" align="left">Yi-VL-34B</td>
143
+ <td>34B</td>
144
+ <td>290</td>
145
+ <td>43.4*</td>
146
+ <td>16.9*</td>
147
+ <td>52.2</td>
148
+ <td><strong>2050.2</strong></td>
149
+ <td>72.4</td>
150
+ <td>70.7</td>
151
+ <td>45.1</td>
152
+ <td>30.7</td>
153
+ <td>62.3</td>
154
+ <td>54.8</td>
155
+ <td>79.3</td>
156
+ </tr>
157
+ <tr>
158
+ <td nowrap="nowrap" align="left">CogVLM-Chat</td>
159
+ <td>17.4B</td>
160
+ <td>590</td>
161
+ <td>70.4</td>
162
+ <td>33.3*</td>
163
+ <td>54.2</td>
164
+ <td>1736.6</td>
165
+ <td>65.8</td>
166
+ <td>55.9</td>
167
+ <td>37.3</td>
168
+ <td>34.7</td>
169
+ <td>73.9</td>
170
+ <td>60.3</td>
171
+ <td>73.6</td>
172
+ </tr>
173
+ <tr>
174
+ <td nowrap="nowrap" align="left">TextMonkey</td>
175
+ <td>9.7B</td>
176
+ <td>558</td>
177
+ <td>64.3</td>
178
+ <td>66.7</td>
179
+ <td>-</td>
180
+ <td>-</td>
181
+ <td>-</td>
182
+ <td>-</td>
183
+ <td>-</td>
184
+ <td>-</td>
185
+ <td>-</td>
186
+ <td>-</td>
187
+ <td>-</td>
188
+ </tr>
189
+ <tr>
190
+ <td nowrap="nowrap" align="left">Idefics2</td>
191
+ <td>8.0B</td>
192
+ <td>-</td>
193
+ <td>73.0</td>
194
+ <td>74.0</td>
195
+ <td>57.2</td>
196
+ <td>1847.6</td>
197
+ <td>75.7</td>
198
+ <td>68.6</td>
199
+ <td>45.2</td>
200
+ <td>52.2</td>
201
+ <td>49.1</td>
202
+ <td>60.7</td>
203
+ <td>-</td>
204
+ </tr>
205
+ <tr>
206
+ <td nowrap="nowrap" align="left">Bunny-LLama-3-8B</td>
207
+ <td>8.4B</td>
208
+ <td>-</td>
209
+ <td>-</td>
210
+ <td>-</td>
211
+ <td>54.3</td>
212
+ <td>1920.3</td>
213
+ <td>77.0</td>
214
+ <td>73.9</td>
215
+ <td>41.3</td>
216
+ <td>31.5</td>
217
+ <td>61.2</td>
218
+ <td>58.8</td>
219
+ <td>-</td>
220
+ </tr>
221
+ <tr>
222
+ <td nowrap="nowrap" align="left">LLaVA-NeXT Llama-3-8B</td>
223
+ <td>8.4B</td>
224
+ <td>-</td>
225
+ <td>-</td>
226
+ <td>78.2</td>
227
+ <td>-</td>
228
+ <td>1971.5</td>
229
+ <td>-</td>
230
+ <td>-</td>
231
+ <td>41.7</td>
232
+ <td>37.5</td>
233
+ <td>80.1</td>
234
+ <td>60.0</td>
235
+ <td>-</td>
236
+ </tr>
237
+ <tr>
238
+ <td nowrap="nowrap" align="left">Phi-3-vision-128k-instruct</td>
239
+ <td>4.2B</td>
240
+ <td>639*</td>
241
+ <td>70.9</td>
242
+ <td>-</td>
243
+ <td>-</td>
244
+ <td>1537.5*</td>
245
+ <td>-</td>
246
+ <td>-</td>
247
+ <td>40.4</td>
248
+ <td>44.5</td>
249
+ <td>64.2*</td>
250
+ <td>58.8*</td>
251
+ <td>-</td>
252
+ </tr>
253
+ <tr style="background-color: #e6f2ff;">
254
+ <td nowrap="nowrap" align="left">MiniCPM-V 1.0</td>
255
+ <td>2.8B</td>
256
+ <td>366</td>
257
+ <td>60.6</td>
258
+ <td>38.2</td>
259
+ <td>47.5</td>
260
+ <td>1650.2</td>
261
+ <td>64.1</td>
262
+ <td>62.6</td>
263
+ <td>38.3</td>
264
+ <td>28.9</td>
265
+ <td>51.3</td>
266
+ <td>51.2</td>
267
+ <td>78.4</td>
268
+ </tr>
269
+ <tr style="background-color: #e6f2ff;">
270
+ <td nowrap="nowrap" align="left">MiniCPM-V 2.0</td>
271
+ <td>2.8B</td>
272
+ <td>605</td>
273
+ <td>74.1</td>
274
+ <td>71.9</td>
275
+ <td>54.5</td>
276
+ <td>1808.6</td>
277
+ <td>69.1</td>
278
+ <td>66.5</td>
279
+ <td>38.2</td>
280
+ <td>38.7</td>
281
+ <td>69.2</td>
282
+ <td>55.8</td>
283
+ <td>85.5</td>
284
+ </tr>
285
+ <tr style="background-color: #e6f2ff;">
286
+ <td nowrap="nowrap" align="left">MiniCPM-Llama3-V 2.5</td>
287
+ <td>8.5B</td>
288
+ <td><strong>725</strong></td>
289
+ <td><strong>76.6</strong></td>
290
+ <td><strong>84.8</strong></td>
291
+ <td><strong>65.1</strong></td>
292
+ <td>2024.6</td>
293
+ <td><strong>77.2</strong></td>
294
+ <td><strong>74.2</strong></td>
295
+ <td><strong>45.8</strong></td>
296
+ <td><strong>54.3</strong></td>
297
+ <td><strong>86.7</strong></td>
298
+ <td><strong>63.5</strong></td>
299
+ <td><strong>89.7</strong></td>
300
+ </tr>
301
+ </tbody>
302
+ </table>
303
+
304
+
305
+ </div>
306
+ * We evaluate the officially released checkpoint by ourselves.
307
+
308
+ </details>
309
+
310
+ <div align="center">
311
+ <img src="../assets/llavabench_compare_3.png" width="100%" />
312
+ <br>
313
+ Evaluation results of multilingual LLaVA Bench
314
+ </div>
315
+
316
+ ### Examples <!-- omit in toc -->
317
+
318
+ <table align="center" >
319
+ <p align="center" >
320
+ <img src="../assets/minicpmv-llama3-v2.5/cases_all.png" />
321
+ </p>
322
+ </table>
323
+
324
+ </details>
325
+
326
+
327
+ ### Model Zoo
328
+
329
+ | Model | Device | Memory | &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; Description | Download |
330
+ |:-----------|:--:|:-----------:|:-------------------|:---------------:|
331
+ | MiniCPM-Llama3-V 2.5 | GPU | 19 GB | Strong end-side multimodal performance. | [🤗](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5/) &nbsp;&nbsp; [<img src="../assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-Llama3-V-2_5) |
332
+ | MiniCPM-Llama3-V 2.5 gguf | CPU | 6 GB | The gguf version, lower memory usage and faster inference. | [🤗](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5-gguf) &nbsp;&nbsp;[<img src="../assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-Llama3-V-2_5-gguf) |
333
+ | MiniCPM-Llama3-V 2.5 int4 | GPU | 8 GB | The int4 quantized version, lower GPU memory usage. | [🤗](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5-int4/) &nbsp;&nbsp; [<img src="../assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-Llama3-V-2_5-int4) |
r1-a/response_generation/minicpm/MiniCPM-o/docs/minicpm_v1.md ADDED
@@ -0,0 +1,214 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## MiniCPM-V 1.0
2
+
3
+
4
+ > Archive at:2024-05-19
5
+
6
+ MiniCPM-V 1.0 is an efficient version with promising performance for deployment. The model is built based on SigLip-400M and [MiniCPM-2.4B](https://github.com/OpenBMB/MiniCPM/), connected by a perceiver resampler. Notable features of MiniCPM-V 1.0 include:
7
+
8
+ - ⚡️ **High Efficiency.**
9
+
10
+ MiniCPM-V 1.0 can be **efficiently deployed on most GPU cards and personal computers**, and **even on end devices such as mobile phones**. In terms of visual encoding, we compress the image representations into 64 tokens via a perceiver resampler, which is significantly fewer than other LMMs based on MLP architecture (typically > 512 tokens). This allows MiniCPM-V 1.0 to operate with **much less memory cost and higher speed during inference**.
11
+
12
+ - 🔥 **Promising Performance.**
13
+
14
+ MiniCPM-V 1.0 achieves **state-of-the-art performance** on multiple benchmarks (including MMMU, MME, and MMbech, etc) among models with comparable sizes, surpassing existing LMMs built on Phi-2. It even **achieves comparable or better performance than the 9.6B Qwen-VL-Chat**.
15
+
16
+ - 🙌 **Bilingual Support.**
17
+
18
+ MiniCPM-V 1.0 is **the first end-deployable LMM supporting bilingual multimodal interaction in English and Chinese**. This is achieved by generalizing multimodal capabilities across languages, a technique from the ICLR 2024 spotlight [paper](https://arxiv.org/abs/2308.12038).
19
+
20
+ ### Evaluation
21
+
22
+ <div align="center">
23
+
24
+ <table style="margin: 0px auto;">
25
+ <thead>
26
+ <tr>
27
+ <th align="left">Model</th>
28
+ <th>Size</th>
29
+ <th nowrap="nowrap" >Visual Tokens</th>
30
+ <th>MME</th>
31
+ <th nowrap="nowrap" >MMB dev (en)</th>
32
+ <th nowrap="nowrap" >MMB dev (zh)</th>
33
+ <th nowrap="nowrap" >MMMU val</th>
34
+ <th nowrap="nowrap" >CMMMU val</th>
35
+ </tr>
36
+ </thead>
37
+ <tbody align="center">
38
+ <tr>
39
+ <td align="left">LLaVA-Phi</td>
40
+ <td align="right">3B</td>
41
+ <td>576</td>
42
+ <td>1335</td>
43
+ <td>59.8</td>
44
+ <td>- </td>
45
+ <td>- </td>
46
+ <td>- </td>
47
+ </tr>
48
+ <tr>
49
+ <td nowrap="nowrap" align="left">MobileVLM</td>
50
+ <td align="right">3B</td>
51
+ <td>144</td>
52
+ <td>1289</td>
53
+ <td>59.6</td>
54
+ <td>- </td>
55
+ <td>- </td>
56
+ <td>- </td>
57
+ </tr>
58
+ <tr>
59
+ <td nowrap="nowrap" align="left" >Imp-v1</td>
60
+ <td align="right">3B</td>
61
+ <td>576</td>
62
+ <td>1434</td>
63
+ <td>66.5</td>
64
+ <td>- </td>
65
+ <td>- </td>
66
+ <td>- </td>
67
+ </tr>
68
+ <tr>
69
+ <td nowrap="nowrap" align="left" >Qwen-VL-Chat</td>
70
+ <td align="right" >9.6B</td>
71
+ <td>256</td>
72
+ <td>1487</td>
73
+ <td>60.6 </td>
74
+ <td>56.7 </td>
75
+ <td>35.9 </td>
76
+ <td>30.7 </td>
77
+ </tr>
78
+ <tr>
79
+ <td nowrap="nowrap" align="left" >CogVLM</td>
80
+ <td align="right">17.4B </td>
81
+ <td>1225</td>
82
+ <td>1438 </td>
83
+ <td>63.7 </td>
84
+ <td>53.8 </td>
85
+ <td>32.1 </td>
86
+ <td>- </td>
87
+ </tr>
88
+ <tr>
89
+ <td nowrap="nowrap" align="left" ><b>MiniCPM-V 1.0</b></td>
90
+ <td align="right">3B </td>
91
+ <td>64</td>
92
+ <td>1452 </td>
93
+ <td>67.9 </td>
94
+ <td>65.3 </td>
95
+ <td>37.2 </td>
96
+ <td>32.1 </td>
97
+ </tr>
98
+ </tbody>
99
+ </table>
100
+
101
+ </div>
102
+
103
+ ### Examples
104
+
105
+ We deploy MiniCPM-V 1.0 on end devices. The demo video is the raw screen recording on a OnePlus 9R without edition.
106
+
107
+ <table align="center">
108
+ <p align="center">
109
+ <img src="assets/gif_cases/蛇_cn.gif" width=36%/>
110
+ <img src="assets/gif_cases/Mushroom_en.gif" width=36%/>
111
+ </p>
112
+ </table>
113
+
114
+ ## Install
115
+
116
+ 1. Clone this repository and navigate to the source folder
117
+
118
+ ```bash
119
+ git clone https://github.com/OpenBMB/OmniLMM.git
120
+ cd OmniLMM
121
+ ```
122
+
123
+ 2. Create conda environment
124
+
125
+ ```Shell
126
+ conda create -n OmniLMM python=3.10 -y
127
+ conda activate OmniLMM
128
+ ```
129
+
130
+ 3. Install dependencies
131
+
132
+ ```shell
133
+ pip install -r requirements.txt
134
+ ```
135
+
136
+ ## Inference
137
+
138
+ ### Model Zoo
139
+ | Model | Description | Download Link |
140
+ |:----------------------|:-------------------|:---------------:|
141
+ | MiniCPM-V 1.0 | The efficient version for end device deployment. | [🤗](https://huggingface.co/openbmb/MiniCPM-V) &nbsp;&nbsp; [<img src="./assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-V/files) |
142
+
143
+
144
+ ### Multi-turn Conversation
145
+ Please refer to the following codes to run `MiniCPM-V 1.0`.
146
+
147
+ <div align="center">
148
+ <img src="assets/worldmap_ck.jpg" width="500px">
149
+ </div>
150
+
151
+
152
+ ```python
153
+ from chat import OmniLMMChat, img2base64
154
+
155
+ chat_model = OmniLMMChat('openbmb/MiniCPM-V')
156
+
157
+ im_64 = img2base64('./assets/worldmap_ck.jpg')
158
+
159
+ # First round chat
160
+ msgs = [{"role": "user", "content": "What is interesting about this image?"}]
161
+
162
+ inputs = {"image": im_64, "question": json.dumps(msgs)}
163
+ answer = chat_model.chat(inputs)
164
+ print(answer)
165
+
166
+ # Second round chat
167
+ # pass history context of multi-turn conversation
168
+ msgs.append({"role": "assistant", "content": answer})
169
+ msgs.append({"role": "user", "content": "Where is China in the image"})
170
+
171
+ inputs = {"image": im_64, "question": json.dumps(msgs)}
172
+ answer = chat_model.chat(inputs)
173
+ print(answer)
174
+ ```
175
+
176
+
177
+ ### Inference on Mac
178
+ <details>
179
+ <summary>Click to view example, MiniCPM-V 1.0 can run on Mac with MPS (Apple silicon or AMD GPUs). </summary>
180
+
181
+ ```python
182
+ # test.py
183
+ import torch
184
+ from PIL import Image
185
+ from transformers import AutoModel, AutoTokenizer
186
+
187
+ model = AutoModel.from_pretrained('openbmb/MiniCPM-V', trust_remote_code=True, torch_dtype=torch.bfloat16)
188
+ model = model.to(device='mps', dtype=torch.float16)
189
+
190
+ tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V', trust_remote_code=True)
191
+ model.eval()
192
+
193
+ image = Image.open('./assets/worldmap_ck.jpg').convert('RGB')
194
+ question = 'What is interesting about this image?'
195
+ msgs = [{'role': 'user', 'content': question}]
196
+
197
+ answer, context, _ = model.chat(
198
+ image=image,
199
+ msgs=msgs,
200
+ context=None,
201
+ tokenizer=tokenizer,
202
+ sampling=True
203
+ )
204
+ print(answer)
205
+ ```
206
+ Run with command:
207
+ ```shell
208
+ PYTORCH_ENABLE_MPS_FALLBACK=1 python test.py
209
+ ```
210
+ </details>
211
+
212
+ ### Deployment on Mobile Phone
213
+
214
+ Currently MiniCPM-V 1.0 can be deployed on mobile phones with Android and Harmony operating systems. 🚀 Try it out [here](https://github.com/OpenBMB/mlc-MiniCPM).
r1-a/response_generation/minicpm/MiniCPM-o/docs/minicpm_v2.md ADDED
@@ -0,0 +1,299 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## MiniCPM-V 2.0
2
+
3
+
4
+ > Archive at:2025-01-13
5
+
6
+
7
+
8
+ **MiniCPM-V 2.0** is an efficient version with promising performance for deployment. The model is built based on SigLip-400M and [MiniCPM-2.4B](https://github.com/OpenBMB/MiniCPM/), connected by a perceiver resampler. Our latest version, MiniCPM-V 2.0 has several notable features.
9
+
10
+ - 🔥 **State-of-the-art Performance.**
11
+
12
+ MiniCPM-V 2.0 achieves **state-of-the-art performance** on multiple benchmarks (including OCRBench, TextVQA, MME, MMB, MathVista, etc) among models under 7B parameters. It even **outperforms strong Qwen-VL-Chat 9.6B, CogVLM-Chat 17.4B, and Yi-VL 34B on OpenCompass, a comprehensive evaluation over 11 popular benchmarks**. Notably, MiniCPM-V 2.0 shows **strong OCR capability**, achieving **comparable performance to Gemini Pro in scene-text understanding**, and **state-of-the-art performance on OCRBench** among open-source models.
13
+
14
+ - 🏆 **Trustworthy Behavior.**
15
+
16
+ LMMs are known for suffering from hallucination, often generating text not factually grounded in images. MiniCPM-V 2.0 is **the first end-side LMM aligned via multimodal RLHF for trustworthy behavior** (using the recent [RLHF-V](https://rlhf-v.github.io/) [CVPR'24] series technique). This allows the model to **match GPT-4V in preventing hallucinations** on Object HalBench.
17
+
18
+ - 🌟 **High-Resolution Images at Any Aspect Raito.**
19
+
20
+ MiniCPM-V 2.0 can accept **1.8 million pixels (e.g., 1344x1344) images at any aspect ratio**. This enables better perception of fine-grained visual information such as small objects and optical characters, which is achieved via a recent technique from [LLaVA-UHD](https://arxiv.org/pdf/2403.11703.pdf).
21
+
22
+ - ⚡️ **High Efficiency.**
23
+
24
+ MiniCPM-V 2.0 can be **efficiently deployed on most GPU cards and personal computers**, and **even on end devices such as mobile phones**. For visual encoding, we compress the image representations into much fewer tokens via a perceiver resampler. This allows MiniCPM-V 2.0 to operate with **favorable memory cost and speed during inference even when dealing with high-resolution images**.
25
+
26
+ - 🙌 **Bilingual Support.**
27
+
28
+ MiniCPM-V 2.0 **supports strong bilingual multimodal capabilities in both English and Chinese**. This is enabled by generalizing multimodal capabilities across languages, a technique from [VisCPM](https://arxiv.org/abs/2308.12038) [ICLR'24].
29
+
30
+
31
+ ### Evaluation <!-- omit in toc -->
32
+
33
+ <div align="center">
34
+ <img src=../assets/minicpmv-2-peformance.png width=66% />
35
+ </div>
36
+ <details>
37
+ <summary>Click to view results on TextVQA, DocVQA, OCRBench, OpenCompass, MME, MMBench, MMMU, MathVista, LLaVA Bench, Object HalBench. </summary>
38
+ <div align="center">
39
+
40
+ <table style="margin: 0px auto;">
41
+ <thead>
42
+ <tr>
43
+ <th align="left">Model</th>
44
+ <th>Size</th>
45
+ <th>TextVQA val</th>
46
+ <th>DocVQA test</th>
47
+ <th>OCRBench</th>
48
+ <th>OpenCompass</th>
49
+ <th nowrap="nowrap" >MME</th>
50
+ <th>MMB dev(en)</th>
51
+ <th>MMB dev(zh)</th>
52
+ <th>MMMU val</th>
53
+ <th>MathVista</th>
54
+ <th>LLaVA Bench</th>
55
+ <th nowrap="nowrap">Object HalBench</th>
56
+ </tr>
57
+ </thead>
58
+ <tbody align="center">
59
+ <tr>
60
+ <td colspan="12" align="left"><strong>Proprietary models</strong></td>
61
+ </tr>
62
+ <tr>
63
+ <td nowrap="nowrap" align="left">Gemini Pro Vision</td>
64
+ <td>- </td>
65
+ <td>74.6</td>
66
+ <td>88.1</td>
67
+ <td>680</td>
68
+ <td>63.8</td>
69
+ <td>2148.9</td>
70
+ <td>75.2</td>
71
+ <td>74.0</td>
72
+ <td>48.9</td>
73
+ <td>45.8</td>
74
+ <td>79.9</td>
75
+ <td>- </td>
76
+ </tr>
77
+ <tr>
78
+ <td nowrap="nowrap" align="left">GPT-4V</td>
79
+ <td>- </td>
80
+ <td>78.0</td>
81
+ <td>88.4</td>
82
+ <td>645</td>
83
+ <td>63.2</td>
84
+ <td>1771.5</td>
85
+ <td>75.1</td>
86
+ <td>75.0</td>
87
+ <td>53.8</td>
88
+ <td>47.8</td>
89
+ <td>93.1</td>
90
+ <td>86.4 / 92.7</td>
91
+ </tr>
92
+ <tr>
93
+ <td colspan="12" align="left"><strong>Open-source models 6B~34B</strong></td>
94
+ </tr>
95
+ <tr>
96
+ <td nowrap="nowrap" align="left" >Yi-VL-6B</td>
97
+ <td align="right" >6.7B</td>
98
+ <td>45.5*</td>
99
+ <td>17.1*</td>
100
+ <td>290</td>
101
+ <td>49.3</td>
102
+ <td>1915.1 </td>
103
+ <td>68.6 </td>
104
+ <td>68.3 </td>
105
+ <td>40.3 </td>
106
+ <td>28.8 </td>
107
+ <td>51.9 </td>
108
+ <td>- </td>
109
+ </tr>
110
+ <tr>
111
+ <td nowrap="nowrap" align="left" >Qwen-VL-Chat</td>
112
+ <td align="right" >9.6B</td>
113
+ <td>61.5</td>
114
+ <td>62.6</td>
115
+ <td>488 </td>
116
+ <td>52.1 </td>
117
+ <td>1860.0 </td>
118
+ <td>60.6 </td>
119
+ <td>56.7 </td>
120
+ <td>37.0 </td>
121
+ <td>33.8 </td>
122
+ <td>67.7 </td>
123
+ <td>56.2 / 80.0</td>
124
+ </tr>
125
+ <tr>
126
+ <td nowrap="nowrap" align="left" >Yi-VL-34B</td>
127
+ <td align="right" >34B</td>
128
+ <td>43.4*</td>
129
+ <td>16.9*</td>
130
+ <td>290</td>
131
+ <td>52.6 </td>
132
+ <td>2050.2</td>
133
+ <td>71.1</td>
134
+ <td>71.4</td>
135
+ <td>45.1</td>
136
+ <td>30.7</td>
137
+ <td>62.3</td>
138
+ <td>- </td>
139
+ </tr>
140
+ <tr>
141
+ <td nowrap="nowrap" align="left" >DeepSeek-VL-7B</td>
142
+ <td align="right" >7.3B</td>
143
+ <td>64.7*</td>
144
+ <td>47.0* </td>
145
+ <td>435</td>
146
+ <td>55.6 </td>
147
+ <td>1765.4 </td>
148
+ <td>74.1 </td>
149
+ <td>72.8 </td>
150
+ <td>38.3 </td>
151
+ <td>36.8</td>
152
+ <td>77.8 </td>
153
+ <td>- </td>
154
+ </tr>
155
+ <tr>
156
+ <td nowrap="nowrap" align="left" >TextMonkey</td>
157
+ <td align="right" >9.7B</td>
158
+ <td>64.3</td>
159
+ <td>66.7 </td>
160
+ <td>558</td>
161
+ <td>- </td>
162
+ <td>- </td>
163
+ <td>- </td>
164
+ <td>- </td>
165
+ <td>- </td>
166
+ <td>-</td>
167
+ <td>- </td>
168
+ <td>- </td>
169
+ </tr>
170
+ <tr>
171
+ <td nowrap="nowrap" align="left" >CogVLM-Chat</td>
172
+ <td align="right" >17.4B</td>
173
+ <td>70.4</td>
174
+ <td>33.3*</td>
175
+ <td>590 </td>
176
+ <td>52.5 </td>
177
+ <td>1736.6 </td>
178
+ <td>63.7 </td>
179
+ <td>53.8 </td>
180
+ <td>37.3 </td>
181
+ <td>34.7 </td>
182
+ <td>73.9 </td>
183
+ <td>73.6 / 87.4 </td>
184
+ </tr>
185
+ <tr>
186
+ <td colspan="12" align="left"><strong>Open-source models 1B~3B </strong></td>
187
+ </tr>
188
+ <tr>
189
+ <td nowrap="nowrap" align="left" >DeepSeek-VL-1.3B</td>
190
+ <td align="right" >1.7B</td>
191
+ <td>58.4*</td>
192
+ <td>37.9*</td>
193
+ <td>413</td>
194
+ <td>46.0 </td>
195
+ <td>1531.6 </td>
196
+ <td>64.0 </td>
197
+ <td>61.2 </td>
198
+ <td>33.8 </td>
199
+ <td>29.4 </td>
200
+ <td>51.1 </td>
201
+ <td>- </td>
202
+ </tr>
203
+ <tr>
204
+ <td nowrap="nowrap" align="left" >MobileVLM V2</td>
205
+ <td align="right" >3.1B</td>
206
+ <td>57.5</td>
207
+ <td>19.4*</td>
208
+ <td>-</td>
209
+ <td>-</td>
210
+ <td>1440.5(P) </td>
211
+ <td>63.2 </td>
212
+ <td>-</td>
213
+ <td>-</td>
214
+ <td>-</td>
215
+ <td>-</td>
216
+ <td>-</td>
217
+ </tr>
218
+ <tr>
219
+ <td nowrap="nowrap" align="left" >Mini-Gemini</td>
220
+ <td align="right" >2.2B</td>
221
+ <td>56.2</td>
222
+ <td>34.2*</td>
223
+ <td>-</td>
224
+ <td>-</td>
225
+ <td>1653.0 </td>
226
+ <td>59.8 </td>
227
+ <td>- </td>
228
+ <td>31.7 </td>
229
+ <td>-</td>
230
+ <td>- </td>
231
+ <td>- </td>
232
+ </tr>
233
+ <tr>
234
+ <td nowrap="nowrap" align="left" >MiniCPM-V</td>
235
+ <td align="right" >2.8B </td>
236
+ <td>60.6</td>
237
+ <td>38.2 </td>
238
+ <td>366</td>
239
+ <td>47.6</td>
240
+ <td>1650.2 </td>
241
+ <td>67.9 </td>
242
+ <td>65.3 </td>
243
+ <td><strong>38.3</strong></td>
244
+ <td>28.9</td>
245
+ <td>51.3 </td>
246
+ <td>78.4 / 88.5 </td>
247
+ </tr>
248
+ <tr>
249
+ <td nowrap="nowrap" align="left" ><strong>MiniCPM-V 2.0</strong></td>
250
+ <td align="right" >2.8B </td>
251
+ <td><strong>74.1</strong></td>
252
+ <td><strong>71.9</strong> </td>
253
+ <td><strong>605</strong></td>
254
+ <td><strong>55.0</strong></td>
255
+ <td><strong>1808.6</strong> </td>
256
+ <td><strong>69.6</strong> </td>
257
+ <td><strong>68.1</strong> </td>
258
+ <td>38.2 </td>
259
+ <td><strong>38.7</strong></td>
260
+ <td><strong>69.2</strong> </td>
261
+ <td><strong>85.5 / 92.2 </strong></td>
262
+ </tr>
263
+ </tbody>
264
+ </table>
265
+
266
+ </div>
267
+ * We evaluate the officially released checkpoint by ourselves.
268
+ </details>
269
+
270
+ ### Examples <!-- omit in toc -->
271
+
272
+ <table align="center">
273
+ <p align="center">
274
+ <img src="../assets/minicpmv2-cases_2.png" width=95%/>
275
+ </p>
276
+ </table>
277
+
278
+ We deploy MiniCPM-V 2.0 on end devices. The demo video is the raw screen recording on a Xiaomi 14 Pro without edition.
279
+
280
+ <table align="center">
281
+ <p align="center">
282
+ <img src="../assets/gif_cases/station.gif" width=36%/>
283
+ <img src="../assets/gif_cases/london_car.gif" width=36%/>
284
+ </p>
285
+ </table>
286
+
287
+
288
+
289
+ ### Model Zoo
290
+
291
+ | Model | Device | Memory | &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; Description | Download |
292
+ |:-----------|:--:|:-----------:|:-------------------|:---------------:|
293
+ | MiniCPM-V 2.0 | GPU | 8 GB | Light version, balance the performance the computation cost. | [🤗](https://huggingface.co/openbmb/MiniCPM-V-2) &nbsp;&nbsp; [<img src="../assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-V-2) |
294
+ | MiniCPM-V 1.0 | GPU | 7 GB | Lightest version, achieving the fastest inference. | [🤗](https://huggingface.co/openbmb/MiniCPM-V) &nbsp;&nbsp; [<img src="../assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-V) |
295
+
296
+
297
+ ### Deployment on Mobile Phone
298
+
299
+ MiniCPM-V 2.0 can be deployed on mobile phones with Android operating systems. 🚀 Click [MiniCPM-V 2.0](https://github.com/OpenBMB/mlc-MiniCPM) to install apk.
r1-a/response_generation/minicpm/MiniCPM-o/docs/minicpm_v2dot6.md ADDED
@@ -0,0 +1,945 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## MiniCPM-V 2.6
2
+
3
+ > Archieve at: 2025-01-13
4
+
5
+ **MiniCPM-V 2.6** is the latest and most capable model in the MiniCPM-V series. The model is built on SigLip-400M and Qwen2-7B with a total of 8B parameters. It exhibits a significant performance improvement over MiniCPM-Llama3-V 2.5, and introduces new features for multi-image and video understanding. Notable features of MiniCPM-V 2.6 include:
6
+
7
+ - 🔥 **Leading Performance.**
8
+ MiniCPM-V 2.6 achieves an average score of 65.2 on the latest version of OpenCompass, a comprehensive evaluation over 8 popular benchmarks. **With only 8B parameters, it surpasses widely used proprietary models like GPT-4o mini, GPT-4V, Gemini 1.5 Pro, and Claude 3.5 Sonnet** for single image understanding.
9
+
10
+ - 🖼️ **Multi Image Understanding and In-context Learning.** MiniCPM-V 2.6 can also perform **conversation and reasoning over multiple images**. It achieves **state-of-the-art performance** on popular multi-image benchmarks such as Mantis-Eval, BLINK, Mathverse mv and Sciverse mv, and also shows promising in-context learning capability.
11
+
12
+ - 🎬 **Video Understanding.** MiniCPM-V 2.6 can also **accept video inputs**, performing conversation and providing dense captions for spatial-temporal information. It outperforms **GPT-4V, Claude 3.5 Sonnet and LLaVA-NeXT-Video-34B** on Video-MME with/without subtitles.
13
+
14
+ - 💪 **Strong OCR Capability and Others.**
15
+ MiniCPM-V 2.6 can process images with any aspect ratio and up to 1.8 million pixels (e.g., 1344x1344). It achieves **state-of-the-art performance on OCRBench, surpassing proprietary models such as GPT-4o, GPT-4V, and Gemini 1.5 Pro**.
16
+ Based on the the latest [RLAIF-V](https://github.com/RLHF-V/RLAIF-V/) and [VisCPM](https://github.com/OpenBMB/VisCPM) techniques, it features **trustworthy behaviors**, with significantly lower hallucination rates than GPT-4o and GPT-4V on Object HalBench, and supports **multilingual capabilities** on English, Chinese, German, French, Italian, Korean, etc.
17
+
18
+
19
+ - 🚀 **Superior Efficiency.**
20
+ In addition to its friendly size, MiniCPM-V 2.6 also shows **state-of-the-art token density** (i.e., number of pixels encoded into each visual token). **It produces only 640 tokens when processing a 1.8M pixel image, which is 75% fewer than most models**. This directly improves the inference speed, first-token latency, memory usage, and power consumption. As a result, MiniCPM-V 2.6 can efficiently support **real-time video understanding** on end-side devices such as iPad.
21
+
22
+ - 💫 **Easy Usage.**
23
+ MiniCPM-V 2.6 can be easily used in various ways: (1) [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpmv-main/examples/llava/README-minicpmv2.6.md) and [ollama](https://github.com/OpenBMB/ollama/blob/minicpm-v2.6/examples/minicpm-v2.6/README.md) support for efficient CPU inference on local devices, (2) [int4](https://huggingface.co/openbmb/MiniCPM-V-2_6-int4) and [GGUF](https://huggingface.co/openbmb/MiniCPM-V-2_6-gguf) format quantized models in 16 sizes, (3) [vLLM](#inference-with-vllm) support for high-throughput and memory-efficient inference, (4) fine-tuning on new domains and tasks, (5) quick local WebUI demo setup with [Gradio](#chat-with-our-demo-on-gradio), and (6) online web [demo](http://120.92.209.146:8887/).
24
+
25
+ ### Evaluation <!-- omit in toc -->
26
+ <div align="center">
27
+ <img src=../assets/radar_final.png width=66% />
28
+ </div>
29
+
30
+ <details>
31
+ <summary>Click to view single image results on OpenCompass, MME, MMVet, OCRBench, MMMU, MathVista, MMB, AI2D, TextVQA, DocVQA, HallusionBench, Object HalBench. </summary>
32
+ <div align="center">
33
+
34
+ <table style="margin: 0px auto;">
35
+ <thead>
36
+ <tr>
37
+ <th align="left">Model</th>
38
+ <th>Size</th>
39
+ <th>Token Density<sup>+</sup></th>
40
+ <th>OpenCompass</th>
41
+ <th>MME</th>
42
+ <th>MMVet</th>
43
+ <th>OCRBench</th>
44
+ <th>MMMU val</th>
45
+ <th>MathVista mini</th>
46
+ <th>MMB1.1 test</th>
47
+ <th>AI2D</th>
48
+ <th>TextVQA val</th>
49
+ <th>DocVQA test</th>
50
+ <th>HallusionBench</th>
51
+ <th>Object HalBench</th>
52
+ </tr>
53
+ </thead>
54
+ <tbody align="center">
55
+ <tr>
56
+ <td colspan="15" align="left"><strong>Proprietary</strong></td>
57
+ </tr>
58
+ <tr>
59
+ <td nowrap="nowrap" align="left">GPT-4o</td>
60
+ <td>-</td>
61
+ <td>1088</td>
62
+ <td>69.9</td>
63
+ <td>2328.7</td>
64
+ <td>69.1</td>
65
+ <td>736</td>
66
+ <td>69.2</td>
67
+ <td>61.3</td>
68
+ <td>82.2</td>
69
+ <td>84.6</td>
70
+ <td>-</td>
71
+ <td>92.8</td>
72
+ <td>55.0</td>
73
+ <td>17.6</td>
74
+ </tr>
75
+ <tr>
76
+ <td nowrap="nowrap" align="left">Claude 3.5 Sonnet</td>
77
+ <td>-</td>
78
+ <td>750</td>
79
+ <td>67.9</td>
80
+ <td>1920.0</td>
81
+ <td>66.0</td>
82
+ <td>788</td>
83
+ <td>65.9</td>
84
+ <td>61.6</td>
85
+ <td>78.5</td>
86
+ <td>80.2</td>
87
+ <td>-</td>
88
+ <td>95.2</td>
89
+ <td>49.9</td>
90
+ <td>13.8</td>
91
+ </tr>
92
+ <tr>
93
+ <td nowrap="nowrap" align="left">Gemini 1.5 Pro</td>
94
+ <td>-</td>
95
+ <td>-</td>
96
+ <td>64.4</td>
97
+ <td>2110.6</td>
98
+ <td>64.0</td>
99
+ <td>754</td>
100
+ <td>60.6</td>
101
+ <td>57.7</td>
102
+ <td>73.9</td>
103
+ <td>79.1</td>
104
+ <td>73.5</td>
105
+ <td>86.5</td>
106
+ <td>45.6</td>
107
+ <td>-</td>
108
+ </tr>
109
+ <tr>
110
+ <td nowrap="nowrap" align="left">GPT-4o mini</td>
111
+ <td>-</td>
112
+ <td>1088</td>
113
+ <td>64.1</td>
114
+ <td>2003.4</td>
115
+ <td>66.9</td>
116
+ <td>785</td>
117
+ <td>60.0</td>
118
+ <td>52.4</td>
119
+ <td>76.0</td>
120
+ <td>77.8</td>
121
+ <td>-</td>
122
+ <td>-</td>
123
+ <td>46.1</td>
124
+ <td>12.4</td>
125
+ </tr>
126
+ <tr>
127
+ <td nowrap="nowrap" align="left">GPT-4V</td>
128
+ <td>-</td>
129
+ <td>1088</td>
130
+ <td>63.5</td>
131
+ <td>2070.2</td>
132
+ <td>67.5</td>
133
+ <td>656</td>
134
+ <td>61.7</td>
135
+ <td>54.7</td>
136
+ <td>79.8</td>
137
+ <td>78.6</td>
138
+ <td>78.0</td>
139
+ <td>87.2</td>
140
+ <td>43.9</td>
141
+ <td>14.2</td>
142
+ </tr>
143
+ <tr>
144
+ <td nowrap="nowrap" align="left">Step-1V</td>
145
+ <td>-</td>
146
+ <td>-</td>
147
+ <td>59.5</td>
148
+ <td>2206.4</td>
149
+ <td>63.3</td>
150
+ <td>625</td>
151
+ <td>49.9</td>
152
+ <td>44.8</td>
153
+ <td>78.0</td>
154
+ <td>79.2</td>
155
+ <td>71.6</td>
156
+ <td>-</td>
157
+ <td>48.4</td>
158
+ <td>-</td>
159
+ </tr>
160
+ <tr>
161
+ <td nowrap="nowrap" align="left">Qwen-VL-Max</td>
162
+ <td>-</td>
163
+ <td>784</td>
164
+ <td>58.3</td>
165
+ <td>2281.7</td>
166
+ <td>61.8</td>
167
+ <td>684</td>
168
+ <td>52.0</td>
169
+ <td>43.4</td>
170
+ <td>74.6</td>
171
+ <td>75.7</td>
172
+ <td>79.5</td>
173
+ <td>93.1</td>
174
+ <td>41.2</td>
175
+ <td>13.4</td>
176
+ </tr>
177
+ <tr>
178
+ <td colspan="15" align="left"><strong>Open-source</strong></td>
179
+ </tr>
180
+ <tr>
181
+ <td nowrap="nowrap" align="left">LLaVA-NeXT-Yi-34B</td>
182
+ <td>34B</td>
183
+ <td>157</td>
184
+ <td>55.0</td>
185
+ <td>2006.5</td>
186
+ <td>50.7</td>
187
+ <td>574</td>
188
+ <td>48.8</td>
189
+ <td>40.4</td>
190
+ <td>77.8</td>
191
+ <td>78.9</td>
192
+ <td>69.3</td>
193
+ <td>-</td>
194
+ <td>34.8</td>
195
+ <td>12.6</td>
196
+ </tr>
197
+ <tr>
198
+ <td nowrap="nowrap" align="left">Mini-Gemini-HD-34B</td>
199
+ <td>34B</td>
200
+ <td>157</td>
201
+ <td>-</td>
202
+ <td>2141.0</td>
203
+ <td>59.3</td>
204
+ <td>518</td>
205
+ <td>48.0</td>
206
+ <td>43.3</td>
207
+ <td>-</td>
208
+ <td>80.5</td>
209
+ <td>74.1</td>
210
+ <td>78.9</td>
211
+ <td>-</td>
212
+ <td>-</td>
213
+ </tr>
214
+ <tr>
215
+ <td nowrap="nowrap" align="left">Cambrian-34B</td>
216
+ <td>34B</td>
217
+ <td>1820</td>
218
+ <td>58.3</td>
219
+ <td>2049.9</td>
220
+ <td>53.2</td>
221
+ <td>591</td>
222
+ <td>50.4</td>
223
+ <td>50.3</td>
224
+ <td>77.8</td>
225
+ <td>79.5</td>
226
+ <td>76.7</td>
227
+ <td>75.5</td>
228
+ <td>41.6</td>
229
+ <td>14.7</td>
230
+ </tr>
231
+ <tr>
232
+ <td nowrap="nowrap" align="left">GLM-4V-9B</td>
233
+ <td>13B</td>
234
+ <td>784</td>
235
+ <td>59.1</td>
236
+ <td>2018.8</td>
237
+ <td>58.0</td>
238
+ <td>776</td>
239
+ <td>46.9</td>
240
+ <td>51.1</td>
241
+ <td>67.9</td>
242
+ <td>71.2</td>
243
+ <td>-</td>
244
+ <td>-</td>
245
+ <td>45.0</td>
246
+ <td>-</td>
247
+ </tr>
248
+ <tr>
249
+ <td nowrap="nowrap" align="left">InternVL2-8B</td>
250
+ <td>8B</td>
251
+ <td>706</td>
252
+ <td>64.1</td>
253
+ <td>2215.1</td>
254
+ <td>54.3</td>
255
+ <td>794</td>
256
+ <td><strong>51.2</strong></td>
257
+ <td>58.3</td>
258
+ <td><strong>79.4</strong></td>
259
+ <td><strong>83.6</strong></td>
260
+ <td>77.4</td>
261
+ <td><strong>91.6</strong></td>
262
+ <td>45.0</td>
263
+ <td>21.3</td>
264
+ </tr>
265
+ <tr>
266
+ <td nowrap="nowrap" align="left">MiniCPM-Llama-V 2.5</td>
267
+ <td>8B</td>
268
+ <td>1882</td>
269
+ <td>58.8</td>
270
+ <td>2024.6</td>
271
+ <td>52.8</td>
272
+ <td>725</td>
273
+ <td>45.8</td>
274
+ <td>54.3</td>
275
+ <td>72.0</td>
276
+ <td>78.4</td>
277
+ <td>76.6</td>
278
+ <td>84.8</td>
279
+ <td>42.4</td>
280
+ <td>10.3</td>
281
+ </tr>
282
+ <tr style="background-color: #e6f2ff;">
283
+ <td nowrap="nowrap" align="left">MiniCPM-V 2.6</td>
284
+ <td>8B</td>
285
+ <td><strong>2822</strong></td>
286
+ <td><strong>65.2</strong></td>
287
+ <td><strong>2348.4</strong>*</td>
288
+ <td><strong>60.0</strong></td>
289
+ <td><strong>852</strong>*</td>
290
+ <td>49.8*</td>
291
+ <td><strong>60.6</strong></td>
292
+ <td>78.0</td>
293
+ <td>82.1</td>
294
+ <td><strong>80.1<strong></td>
295
+ <td>90.8</td>
296
+ <td><strong>48.1</strong>*</td>
297
+ <td><strong>8.2</strong></td>
298
+ </tr>
299
+ </tbody>
300
+ </table>
301
+
302
+ </div>
303
+ * We evaluate this benchmark using chain-of-thought prompting. Specifically, for MME, we used this technique only for the Cognition set.
304
+
305
+ <sup>+</sup> Token Density: number of pixels encoded into each visual token at maximum resolution, i.e., # pixels at maximum resolution / # visual tokens.
306
+
307
+ Note: For proprietary models, we calculate token density based on the image encoding charging strategy defined in the official API documentation, which provides an upper-bound estimation.
308
+
309
+ </details>
310
+
311
+
312
+ <details>
313
+ <summary>Click to view multi-image results on Mantis Eval, BLINK, Mathverse mv, Sciverse mv, MIRB.</summary>
314
+ <div align="center">
315
+
316
+ <table style="margin: 0px auto;">
317
+ <thead>
318
+ <tr>
319
+ <th align="left">Model</th>
320
+ <th>Size</th>
321
+ <th>Mantis Eval</th>
322
+ <th>BLINK val</th>
323
+ <th>Mathverse mv</th>
324
+ <th>Sciverse mv</th>
325
+ <th>MIRB</th>
326
+ </tr>
327
+ </thead>
328
+ <tbody align="center">
329
+ <tr>
330
+ <td colspan="7" align="left"><strong>Proprietary</strong></td>
331
+ </tr>
332
+ <tr>
333
+ <td nowrap="nowrap" align="left">GPT-4V</td>
334
+ <td>-</td>
335
+ <td>62.7</td>
336
+ <td>54.6</td>
337
+ <td>60.3</td>
338
+ <td>66.9</td>
339
+ <td>53.1</td>
340
+ </tr>
341
+ <tr>
342
+ <td nowrap="nowrap" align="left">LLaVA-NeXT-Interleave-14B</td>
343
+ <td>14B</td>
344
+ <td>66.4</td>
345
+ <td>52.6</td>
346
+ <td>32.7</td>
347
+ <td>30.2</td>
348
+ <td>-</td>
349
+ </tr>
350
+ <tr>
351
+ <td colspan="7" align="left"><strong>Open-source</strong></td>
352
+ </tr>
353
+ <tr>
354
+ <td nowrap="nowrap" align="left">Emu2-Chat</td>
355
+ <td>37B</td>
356
+ <td>37.8</td>
357
+ <td>36.2</td>
358
+ <td>-</td>
359
+ <td>27.2</td>
360
+ <td>-</td>
361
+ </tr>
362
+ <tr>
363
+ <td nowrap="nowrap" align="left">CogVLM</td>
364
+ <td>17B</td>
365
+ <td>45.2</td>
366
+ <td>41.1</td>
367
+ <td>-</td>
368
+ <td>-</td>
369
+ <td>-</td>
370
+ </tr>
371
+ <tr>
372
+ <td nowrap="nowrap" align="left">VPG-C</td>
373
+ <td>7B</td>
374
+ <td>52.4</td>
375
+ <td>43.1</td>
376
+ <td>24.3</td>
377
+ <td>23.1</td>
378
+ <td>-</td>
379
+ </tr>
380
+ <tr>
381
+ <td nowrap="nowrap" align="left">VILA 8B</td>
382
+ <td>8B</td>
383
+ <td>51.2</td>
384
+ <td>39.3</td>
385
+ <td>-</td>
386
+ <td>36.5</td>
387
+ <td>-</td>
388
+ </tr>
389
+ <tr>
390
+ <td nowrap="nowrap" align="left">InternLM-XComposer-2.5</td>
391
+ <td>8B</td>
392
+ <td>53.1*</td>
393
+ <td>48.9</td>
394
+ <td>32.1*</td>
395
+ <td>-</td>
396
+ <td>42.5</td>
397
+ </tr>
398
+ <tr>
399
+ <td nowrap="nowrap" align="left">InternVL2-8B</td>
400
+ <td>8B</td>
401
+ <td>59.0*</td>
402
+ <td>50.9</td>
403
+ <td>30.5*</td>
404
+ <td>34.4*</td>
405
+ <td><strong>56.9*</strong></td>
406
+ </tr>
407
+ <tr style="background-color: #e6f2ff;">
408
+ <td nowrap="nowrap" align="left">MiniCPM-V 2.6</td>
409
+ <td>8B</td>
410
+ <td><strong>69.1</strong></td>
411
+ <td><strong>53.0</strong></td>
412
+ <td><strong>84.9</strong></td>
413
+ <td><strong>74.9</strong></td>
414
+ <td>53.8</td>
415
+ </tr>
416
+ </tbody>
417
+ </table>
418
+
419
+ </div>
420
+ * We evaluate the officially released checkpoint by ourselves.
421
+ </details>
422
+
423
+ <details>
424
+ <summary>Click to view video results on Video-MME and Video-ChatGPT.</summary>
425
+ <div align="center">
426
+ <table style="margin: 0px auto;">
427
+ <thead>
428
+ <tr>
429
+ <th align="left">Model</th>
430
+ <th>Size</th>
431
+ <th colspan="2">Video-MME</th>
432
+ <th colspan="5">Video-ChatGPT</th>
433
+ </tr>
434
+ <tr>
435
+ <th align="left"></th>
436
+ <th></th>
437
+ <th>w/o subs</th>
438
+ <th>w subs</th>
439
+ <th>Correctness</th>
440
+ <th>Detail</th>
441
+ <th>Context</th>
442
+ <th>Temporal</th>
443
+ <th>Consistency</th>
444
+ </tr>
445
+ </thead>
446
+ <tbody align="center">
447
+ <tr>
448
+ <td colspan="9" align="left"><strong>Proprietary</strong></td>
449
+ </tr>
450
+ <tr>
451
+ <td nowrap="nowrap" align="left">Claude 3.5 Sonnet</td>
452
+ <td>-</td>
453
+ <td>60.0</td>
454
+ <td>62.9</td>
455
+ <td>-</td>
456
+ <td>-</td>
457
+ <td>-</td>
458
+ <td>-</td>
459
+ <td>-</td>
460
+ </tr>
461
+ <tr>
462
+ <td nowrap="nowrap" align="left">GPT-4V</td>
463
+ <td>-</td>
464
+ <td>59.9</td>
465
+ <td>63.3</td>
466
+ <td>-</td>
467
+ <td>-</td>
468
+ <td>-</td>
469
+ <td>-</td>
470
+ <td>-</td>
471
+ </tr>
472
+ <tr>
473
+ <td colspan="9" align="left"><strong>Open-source</strong></td>
474
+ </tr>
475
+ <tr>
476
+ <td nowrap="nowrap" align="left">LLaVA-NeXT-7B</td>
477
+ <td>7B</td>
478
+ <td>-</td>
479
+ <td>-</td>
480
+ <td>3.39</td>
481
+ <td>3.29</td>
482
+ <td>3.92</td>
483
+ <td>2.60</td>
484
+ <td>3.12</td>
485
+ </tr>
486
+ <tr>
487
+ <td nowrap="nowrap" align="left">LLaVA-NeXT-34B</td>
488
+ <td>34B</td>
489
+ <td>-</td>
490
+ <td>-</td>
491
+ <td>3.29</td>
492
+ <td>3.23</td>
493
+ <td>3.83</td>
494
+ <td>2.51</td>
495
+ <td>3.47</td>
496
+ </tr>
497
+ <tr>
498
+ <td nowrap="nowrap" align="left">CogVLM2-Video</td>
499
+ <td>12B</td>
500
+ <td>-</td>
501
+ <td>-</td>
502
+ <td>3.49</td>
503
+ <td><strong>3.46</strong></td>
504
+ <td>3.23</td>
505
+ <td><strong>2.98</strong></td>
506
+ <td><strong>3.64</strong></td>
507
+ </tr>
508
+ <tr>
509
+ <td nowrap="nowrap" align="left">LongVA</td>
510
+ <td>7B</td>
511
+ <td>52.4</td>
512
+ <td>54.3</td>
513
+ <td>3.05</td>
514
+ <td>3.09</td>
515
+ <td>3.77</td>
516
+ <td>2.44</td>
517
+ <td><strong>3.64</strong></td>
518
+ </tr>
519
+ <tr>
520
+ <td nowrap="nowrap" align="left">InternVL2-8B</td>
521
+ <td>8B</td>
522
+ <td>54.0</td>
523
+ <td>56.9</td>
524
+ <td>-</td>
525
+ <td>-</td>
526
+ <td>-</td>
527
+ <td>-</td>
528
+ <td>-</td>
529
+ </tr>
530
+ <tr>
531
+ <td nowrap="nowrap" align="left">InternLM-XComposer-2.5</td>
532
+ <td>8B</td>
533
+ <td>55.8</td>
534
+ <td>-</td>
535
+ <td>-</td>
536
+ <td>-</td>
537
+ <td>-</td>
538
+ <td>-</td>
539
+ <td>-</td>
540
+ </tr>
541
+ <tr>
542
+ <td nowrap="nowrap" align="left">LLaVA-NeXT-Video</td>
543
+ <td>32B</td>
544
+ <td>60.2</td>
545
+ <td>63.0</td>
546
+ <td>3.48</td>
547
+ <td>3.37</td>
548
+ <td><strong>3.95</strong></td>
549
+ <td>2.64</td>
550
+ <td>3.28</td>
551
+ </tr>
552
+ <tr style="background-color: #e6f2ff;">
553
+ <td nowrap="nowrap" align="left">MiniCPM-V 2.6</td>
554
+ <td>8B</td>
555
+ <td><strong>60.9</strong></td>
556
+ <td><strong>63.6</strong></td>
557
+ <td><strong>3.59</strong></td>
558
+ <td>3.28</td>
559
+ <td>3.93</td>
560
+ <td>2.73</td>
561
+ <td>3.62</td>
562
+ </tr>
563
+ </tbody>
564
+ </table>
565
+ </div>
566
+ </details>
567
+
568
+
569
+ <details>
570
+ <summary>Click to view few-shot results on TextVQA, VizWiz, VQAv2, OK-VQA.</summary>
571
+ <div align="center">
572
+ <table style="margin: 0px auto;">
573
+ <thead>
574
+ <tr>
575
+ <th align="left">Model</th>
576
+ <th>Size</th>
577
+ <th>Shot</th>
578
+ <th>TextVQA val</th>
579
+ <th>VizWiz test-dev</th>
580
+ <th>VQAv2 test-dev</th>
581
+ <th>OK-VQA val</th>
582
+ </tr>
583
+ </thead>
584
+ <tbody align="center">
585
+ <tr>
586
+ <td align="left" nowrap="nowrap" rowspan="3">Flamingo</td>
587
+ <td rowspan="3">80B</td>
588
+ <td>0*</td>
589
+ <td>35.0</td>
590
+ <td>31.6</td>
591
+ <td>56.3</td>
592
+ <td>40.6</td>
593
+ </tr>
594
+ <tr>
595
+ <td>4</td>
596
+ <td>36.5</td>
597
+ <td>39.6</td>
598
+ <td>63.1</td>
599
+ <td><strong>57.4</strong></td>
600
+ </tr>
601
+ <tr>
602
+ <td>8</td>
603
+ <td>37.3</td>
604
+ <td>44.8</td>
605
+ <td>65.6</td>
606
+ <td>57.5</td>
607
+ </tr>
608
+ <tr>
609
+ <td align="left" nowrap="nowrap" rowspan="3">IDEFICS</td>
610
+ <td rowspan="3">80B</td>
611
+ <td>0*</td>
612
+ <td>30.9</td>
613
+ <td>36.0</td>
614
+ <td>60.0</td>
615
+ <td>45.2</td>
616
+ </tr>
617
+ <tr>
618
+ <td>4</td>
619
+ <td>34.3</td>
620
+ <td>40.4</td>
621
+ <td>63.6</td>
622
+ <td>52.4</td>
623
+ </tr>
624
+ <tr>
625
+ <td>8</td>
626
+ <td>35.7</td>
627
+ <td>46.1</td>
628
+ <td>64.8</td>
629
+ <td>55.1</td>
630
+ </tr>
631
+ <tr>
632
+ <td align="left" nowrap="nowrap" rowspan="3">OmniCorpus</td>
633
+ <td rowspan="3">7B</td>
634
+ <td>0*</td>
635
+ <td>43.0</td>
636
+ <td>49.8</td>
637
+ <td>63.2</td>
638
+ <td>45.5</td>
639
+ </tr>
640
+ <tr>
641
+ <td>4</td>
642
+ <td>45.4</td>
643
+ <td>51.3</td>
644
+ <td>64.5</td>
645
+ <td>46.5</td>
646
+ </tr>
647
+ <tr>
648
+ <td>8</td>
649
+ <td>45.6</td>
650
+ <td>52.2</td>
651
+ <td>64.7</td>
652
+ <td>46.6</td>
653
+ </tr>
654
+ <tr>
655
+ <td align="left" nowrap="nowrap" rowspan="3">Emu2</td>
656
+ <td rowspan="3">37B</td>
657
+ <td>0</td>
658
+ <td>26.4</td>
659
+ <td>40.4</td>
660
+ <td>33.5</td>
661
+ <td>26.7</td>
662
+ </tr>
663
+ <tr>
664
+ <td>4</td>
665
+ <td>48.2</td>
666
+ <td>54.6</td>
667
+ <td>67.0</td>
668
+ <td>53.2</td>
669
+ </tr>
670
+ <tr>
671
+ <td>8</td>
672
+ <td>49.3</td>
673
+ <td>54.7</td>
674
+ <td>67.8</td>
675
+ <td>54.1</td>
676
+ </tr>
677
+ <tr>
678
+ <td align="left" nowrap="nowrap" rowspan="2">MM1</td>
679
+ <td rowspan="2">30B</td>
680
+ <td>0</td>
681
+ <td>26.2</td>
682
+ <td>40.4</td>
683
+ <td>48.9</td>
684
+ <td>26.7</td>
685
+ </tr>
686
+ <tr>
687
+ <td>8</td>
688
+ <td>49.3</td>
689
+ <td>54.7</td>
690
+ <td><strong>70.9</strong></td>
691
+ <td>54.1</td>
692
+ </tr>
693
+ <tr style="background-color: #e6f2ff;">
694
+ <td align="left" nowrap="nowrap" rowspan="3">MiniCPM-V 2.6<sup>+</sup></td>
695
+ <td rowspan="3">8B</td>
696
+ <td>0</td>
697
+ <td>43.9</td>
698
+ <td>33.8</td>
699
+ <td>45.4</td>
700
+ <td>23.9</td>
701
+ </tr>
702
+ <tr style="background-color: #e6f2ff;">
703
+ <td>4</td>
704
+ <td>63.6</td>
705
+ <td>60.5</td>
706
+ <td>65.5</td>
707
+ <td>50.1</td>
708
+ </tr>
709
+ <tr style="background-color: #e6f2ff;">
710
+ <td>8</td>
711
+ <td><strong>64.6</strong></td>
712
+ <td><strong>63.4</strong></td>
713
+ <td>68.2</td>
714
+ <td>51.4</td>
715
+ </tr>
716
+ </tbody>
717
+ </table>
718
+
719
+
720
+ </div>
721
+ * denotes zero image shot and two additional text shots following Flamingo.
722
+
723
+ <sup>+</sup> We evaluate the pretraining ckpt without SFT.
724
+ </details>
725
+
726
+ ### Examples <!-- omit in toc -->
727
+
728
+ <div style="display: flex; flex-direction: column; align-items: center;">
729
+ <img src="../assets/minicpmv2_6/multi_img-bike.png" alt="Bike" style="margin-bottom: 5px;">
730
+ <img src="../assets/minicpmv2_6/multi_img-menu.png" alt="Menu" style="margin-bottom: 5px;">
731
+ <img src="../assets/minicpmv2_6/multi_img-code.png" alt="Code" style="margin-bottom: 5px;">
732
+ <img src="../assets/minicpmv2_6/ICL-Mem.png" alt="Mem" style="margin-bottom: 5px;">
733
+ <img src="../assets/minicpmv2_6/multiling-medal.png" alt="medal" style="margin-bottom: 10px;">
734
+ </div>
735
+ <details>
736
+ <summary>Click to view more cases.</summary>
737
+ <div style="display: flex; flex-direction: column; align-items: center;">
738
+ <img src="../assets/minicpmv2_6/ICL-elec.png" alt="elec" style="margin-bottom: 5px;">
739
+ <img src="../assets/minicpmv2_6/multiling-olympic.png" alt="Menu" style="margin-bottom: 10px;">
740
+ </div>
741
+ </details>
742
+
743
+ We deploy MiniCPM-V 2.6 on end devices. The demo video is the raw screen recording on a iPad Pro without edition.
744
+
745
+ <table align="center">
746
+ <p align="center">
747
+ <img src="../assets/gif_cases/ai.gif" width=32%/>
748
+ &nbsp;&nbsp;&nbsp;&nbsp;
749
+ <img src="../assets/gif_cases/beer.gif" width=32%/>
750
+ </p>
751
+ </table>
752
+
753
+ <table align="center">
754
+ <p align="center">
755
+ <img src="../assets/gif_cases/ticket.gif" width=32%/>
756
+ &nbsp;&nbsp;&nbsp;&nbsp;
757
+ <img src="../assets/gif_cases/wfh.gif" width=32%/>
758
+ </p>
759
+ </table>
760
+
761
+ <table align="center">
762
+ <p align="center">
763
+ <video src="https://github.com/user-attachments/assets/21f4b818-ede1-4822-920e-91281725c830" width="360" /> </video>
764
+ <!-- <video src="https://github.com/user-attachments/assets/c835f757-206b-4d9c-8e36-70d67b453628" width="360" /> </video> -->
765
+ </p>
766
+ </table>
767
+
768
+ </details>
769
+
770
+
771
+
772
+ ### Multi-turn Conversation
773
+
774
+
775
+ <div align="center">
776
+ <img src="../assets/airplane.jpeg" width="500px">
777
+ </div>
778
+
779
+
780
+ ```python
781
+ import torch
782
+ from PIL import Image
783
+ from transformers import AutoModel, AutoTokenizer
784
+
785
+ torch.manual_seed(0)
786
+
787
+ model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True,
788
+ attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
789
+ model = model.eval().cuda()
790
+ tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True)
791
+
792
+ image = Image.open('./assets/airplane.jpeg').convert('RGB')
793
+
794
+ # First round chat
795
+ question = "Tell me the model of this aircraft."
796
+ msgs = [{'role': 'user', 'content': [image, question]}]
797
+
798
+ answer = model.chat(
799
+ image=None,
800
+ msgs=msgs,
801
+ tokenizer=tokenizer
802
+ )
803
+ print(answer)
804
+
805
+ # Second round chat
806
+ # pass history context of multi-turn conversation
807
+ msgs.append({"role": "assistant", "content": [answer]})
808
+ msgs.append({"role": "user", "content": ["Introduce something about Airbus A380."]})
809
+
810
+ answer = model.chat(
811
+ image=None,
812
+ msgs=msgs,
813
+ tokenizer=tokenizer
814
+ )
815
+ print(answer)
816
+ ```
817
+
818
+ You could get the following output:
819
+
820
+ ```
821
+ "The aircraft in the image is an Airbus A380, which can be identified by its large size, double-deck structure, and the distinctive shape of its wings and engines. The A380 is a wide-body aircraft known for being the world's largest passenger airliner, designed for long-haul flights. It has four engines, which are characteristic of large commercial aircraft. The registration number on the aircraft can also provide specific information about the model if looked up in an aviation database."
822
+
823
+ "The Airbus A380 is a double-deck, wide-body, four-engine jet airliner made by Airbus. It is the world's largest passenger airliner and is known for its long-haul capabilities. The aircraft was developed to improve efficiency and comfort for passengers traveling over long distances. It has two full-length passenger decks, which can accommodate more passengers than a typical single-aisle airplane. The A380 has been operated by airlines such as Lufthansa, Singapore Airlines, and Emirates, among others. It is widely recognized for its unique design and significant impact on the aviation industry."
824
+ ```
825
+
826
+ #### Multi-image Understanding
827
+ <details>
828
+ <summary> Click to view Python example of MiniCPM-V 2.6 multi-image understanding </summary>
829
+
830
+ ```python
831
+ import torch
832
+ from PIL import Image
833
+ from transformers import AutoModel, AutoTokenizer
834
+
835
+ model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True,
836
+ attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
837
+ model = model.eval().cuda()
838
+ tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True)
839
+
840
+ image1 = Image.open('image1.jpg').convert('RGB')
841
+ image2 = Image.open('image2.jpg').convert('RGB')
842
+ question = 'Compare image 1 and image 2, tell me about the differences between image 1 and image 2.'
843
+
844
+ msgs = [{'role': 'user', 'content': [image1, image2, question]}]
845
+
846
+ answer = model.chat(
847
+ image=None,
848
+ msgs=msgs,
849
+ tokenizer=tokenizer
850
+ )
851
+ print(answer)
852
+ ```
853
+ </details>
854
+
855
+ #### Few-shot In-Context-Learning
856
+
857
+ <details>
858
+ <summary> Click to view Python example of MiniCPM-V 2.6 few-shot in-context-learning example </summary>
859
+
860
+ ```python
861
+ import torch
862
+ from PIL import Image
863
+ from transformers import AutoModel, AutoTokenizer
864
+
865
+ model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True,
866
+ attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
867
+ model = model.eval().cuda()
868
+ tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True)
869
+
870
+ question = "production date"
871
+ image1 = Image.open('example1.jpg').convert('RGB')
872
+ answer1 = "2023.08.04"
873
+ image2 = Image.open('example2.jpg').convert('RGB')
874
+ answer2 = "2007.04.24"
875
+ image_test = Image.open('test.jpg').convert('RGB')
876
+
877
+ msgs = [
878
+ {'role': 'user', 'content': [image1, question]}, {'role': 'assistant', 'content': [answer1]},
879
+ {'role': 'user', 'content': [image2, question]}, {'role': 'assistant', 'content': [answer2]},
880
+ {'role': 'user', 'content': [image_test, question]}
881
+ ]
882
+
883
+ answer = model.chat(
884
+ image=None,
885
+ msgs=msgs,
886
+ tokenizer=tokenizer
887
+ )
888
+ print(answer)
889
+ ```
890
+ </details>
891
+
892
+ #### Video understanding
893
+ <details>
894
+ <summary> Click to view Python example of MiniCPM-V 2.6 video understanding </summary>
895
+
896
+ ```python
897
+ import torch
898
+ from PIL import Image
899
+ from transformers import AutoModel, AutoTokenizer
900
+ from decord import VideoReader, cpu # pip install decord
901
+
902
+ model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True,
903
+ attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
904
+ model = model.eval().cuda()
905
+ tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True)
906
+
907
+ MAX_NUM_FRAMES=64 # if cuda OOM set a smaller number
908
+
909
+ def encode_video(video_path):
910
+ def uniform_sample(l, n):
911
+ gap = len(l) / n
912
+ idxs = [int(i * gap + gap / 2) for i in range(n)]
913
+ return [l[i] for i in idxs]
914
+
915
+ vr = VideoReader(video_path, ctx=cpu(0))
916
+ sample_fps = round(vr.get_avg_fps() / 1) # FPS
917
+ frame_idx = [i for i in range(0, len(vr), sample_fps)]
918
+ if len(frame_idx) > MAX_NUM_FRAMES:
919
+ frame_idx = uniform_sample(frame_idx, MAX_NUM_FRAMES)
920
+ frames = vr.get_batch(frame_idx).asnumpy()
921
+ frames = [Image.fromarray(v.astype('uint8')) for v in frames]
922
+ print('num frames:', len(frames))
923
+ return frames
924
+
925
+ video_path="video_test.mp4"
926
+ frames = encode_video(video_path)
927
+ question = "Describe the video"
928
+ msgs = [
929
+ {'role': 'user', 'content': frames + [question]},
930
+ ]
931
+
932
+ # Set decode params for video
933
+ params = {}
934
+ params["use_image_id"] = False
935
+ params["max_slice_nums"] = 2 # 如果cuda OOM且视频分辨率大于448*448可设为1
936
+
937
+ answer = model.chat(
938
+ image=None,
939
+ msgs=msgs,
940
+ tokenizer=tokenizer,
941
+ **params
942
+ )
943
+ print(answer)
944
+ ```
945
+ </details>
r1-a/response_generation/minicpm/MiniCPM-o/docs/omnilmm.md ADDED
@@ -0,0 +1,183 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## OmniLMM-12B
2
+
3
+ > OmniLMM-12B 发布于本项目早期。推荐您使用我们[最新发布的模型](./README_zh.md),以获得更高效的推理和更强大的性能体验。
4
+
5
+ > 归档时间:2024-05-19
6
+
7
+ **OmniLMM-12B** 是当前系列中性能最佳的版本。该模型基于EVA02-5B和Zephyr-7B-β初始化构建,并使用perceiver resampler连接,采用了课程学习的方法在多模态数据上进行训练。该模型具有三个特点:
8
+
9
+ - 🔥 **性能领先。**
10
+
11
+ OmniLMM-12B 相比其他同规模模型在多个基准测试中取得**领先的性能**(包括 MME、MMBench、SEED-Bench 等),模型掌握了较为丰富的多模态世界知识。
12
+
13
+ - 🏆 **行为可信。**
14
+
15
+ 多模态大模型的幻觉问题备受关注,模型经常生成和图像中的事实不符的文本(例如,确信地描述图片中并不存在的物体)。OmniLMM-12B是 **第一个通过多模态 RLHF 对齐的综合能力优秀的开源多模态大模型**(借助 [RLHF-V](https://rlhf-v.github.io/) [CVPR'24] 系列技术)。该模型在 [MMHal-Bench](https://huggingface.co/datasets/Shengcao1006/MMHal-Bench) 幻觉评测基准上达到**开源模型最佳水平**,并在 [Object HalBench](https://arxiv.org/abs/2312.00849) 中**优于GPT-4V**。
16
+
17
+ - 🕹 **实时多模态交互。**
18
+
19
+ 我们尝试结合OmniLMM-12B和GPT-3.5 (纯文本模型) ,实现**实时多模态交互助手**。该模型接受来自摄像头的视频流,并借助工具处理语音输入输出。虽然还很初步,我们发现该模型无需视频编辑可以**复现Gemini演示视频中的一些有趣例子**。
20
+
21
+ ### 评测结果 <!-- omit in toc -->
22
+
23
+ <div align="center">
24
+ <img src=assets/radar_omnilmm12b.png width=66% />
25
+ </div>
26
+ <details>
27
+ <summary> MME, MMBench, MMMU, MMBench, MMHal-Bench, Object HalBench, SeedBench, LLaVA Bench W, MathVista 上的详细评测结果。 </summary>
28
+
29
+ <table>
30
+ <thead>
31
+ <tr>
32
+ <th align="left">Model</th>
33
+ <th>Size</th>
34
+ <th>MME</th>
35
+ <th nowrap="nowrap">MMB dev (en)</th>
36
+ <th nowrap="nowrap" >MMMU val</th>
37
+ <th nowrap="nowrap" >MMHal-Bench</th>
38
+ <th nowrap="nowrap" >Object HalBench</th>
39
+ <th nowrap="nowrap" >SeedBench-I</th>
40
+ <th>MathVista</th>
41
+ <th nowrap="nowrap" >LLaVA Bench</th>
42
+ </tr>
43
+ </thead>
44
+ <tbody align="center">
45
+ <tr>
46
+ <td align="left">GPT-4V†</td>
47
+ <td>-</td>
48
+ <td>1771.5</td>
49
+ <td>75.1 </td>
50
+ <td>56.8</td>
51
+ <td>3.53 / 70.8</td>
52
+ <td>86.4 / 92.7</td>
53
+ <td>71.6 </td>
54
+ <td>47.8 </td>
55
+ <td>93.1 </td>
56
+ </tr>
57
+ <tr>
58
+ <td nowrap="nowrap" align="left">Qwen-VL-Plus†</td>
59
+ <td>-</td>
60
+ <td>2183.4</td>
61
+ <td>66.2 </td>
62
+ <td>45.2</td>
63
+ <td>- </td>
64
+ <td>- </td>
65
+ <td>65.7 </td>
66
+ <td>36.0 </td>
67
+ <td>73.7 </td>
68
+ </tr>
69
+ <tr>
70
+ <td align="left">Yi-VL 6B</td>
71
+ <td align="right">6.7B </td>
72
+ <td>1915.1 </td>
73
+ <td>68.6 </td>
74
+ <td>40.3 </td>
75
+ <td>- </td>
76
+ <td>- </td>
77
+ <td>67.5 </td>
78
+ <td>28.8 </td>
79
+ <td>51.9 </td>
80
+ </tr>
81
+ <tr>
82
+ <td nowrap="nowrap" align="left" >Qwen-VL-Chat</td>
83
+ <td align="right">9.6B</td>
84
+ <td>1860.0</td>
85
+ <td>60.6 </td>
86
+ <td>35.9</td>
87
+ <td>2.93 / 59.4</td>
88
+ <td>56.2 / 80.0</td>
89
+ <td>64.8 </td>
90
+ <td>33.8 </td>
91
+ <td>67.7 </td>
92
+ </tr>
93
+ <tr>
94
+ <td align="left" >CogVLM-Chat</td>
95
+ <td align="right">17.4B</td>
96
+ <td>1736.6</td>
97
+ <td>63.7 </td>
98
+ <td>32.1 </td>
99
+ <td>2.68 / 52.1 </td>
100
+ <td>73.6 / 87.4 </td>
101
+ <td>68.8 </td>
102
+ <td>34.7 </td>
103
+ <td>73.9 </td>
104
+ </tr>
105
+ <tr>
106
+ <td align="left" >LLaVA 1.5</td>
107
+ <td align="right">13.6B </td>
108
+ <td>1808.4 </td>
109
+ <td>68.2 </td>
110
+ <td>36.4 </td>
111
+ <td>2.71 / 51.0 </td>
112
+ <td>53.7 / 77.4 </td>
113
+ <td>68.1 </td>
114
+ <td>26.4 </td>
115
+ <td>64.6 </td>
116
+ </tr>
117
+ <tr>
118
+ <td nowrap="nowrap" align="left" ><b>OmniLMM-12B</b></td>
119
+ <td align="right">11.6B </td>
120
+ <td>1935.8 </td>
121
+ <td>71.6 </td>
122
+ <td>40.7 </td>
123
+ <td>3.45 / 68.8 </td>
124
+ <td>90.3 / 95.5 </td>
125
+ <td>71.1 </td>
126
+ <td>34.9 </td>
127
+ <td>72.0 </td>
128
+ </tr>
129
+ </tbody>
130
+ </table>
131
+ <small>†: 闭源模型</small>
132
+ <br>
133
+ </details>
134
+
135
+ ### 典型示例 <!-- omit in toc -->
136
+
137
+ <table align="center" >
138
+ <p align="center" >
139
+ <img src="assets/omnilmm-12b-examples_2.png" />
140
+ </p>
141
+ </table>
142
+
143
+
144
+ 我们结合 OmniLMM-12B 和 ChatGPT-3.5 (纯文本模型) 尝试构建 **实时多模态交互助手**. OmniLMM-12B 将视频帧转为对应的图像描述并输入给ChatGPT-3.5来生成对用户指令的响应。演示视频未经编辑。
145
+
146
+ <div align="center" >
147
+ <video controls src="https://github.com/OpenBMB/OmniLMM/assets/157115220/8fec13bf-bb47-4bf8-8f8c-d0b716a964ec" type="video/mp4" width=80%/>
148
+ </div>
149
+
150
+ ## Online Demo
151
+
152
+ 欢迎通过以下链接使用我们的网页端推理服务: [OmniLMM-12B](http://120.92.209.146:8081) | [MiniCPM-V 2.0](http://120.92.209.146:80).
153
+
154
+ ## 安装
155
+
156
+ 1. 克隆我们的仓库并跳转到相应目录
157
+
158
+ ```bash
159
+ git clone https://github.com/OpenBMB/MiniCPM-V.git
160
+ cd MiniCPM-V
161
+ ```
162
+
163
+ 1. 创建 conda 环境
164
+
165
+ ```Shell
166
+ conda create -n MiniCPMV python=3.10 -y
167
+ conda activate MiniCPMV
168
+ ```
169
+
170
+ 3. 安装依赖
171
+
172
+ ```shell
173
+ pip install -r requirements.txt
174
+ ```
175
+
176
+ ## 推理
177
+
178
+ ### 模型库
179
+
180
+ | 模型 | 简介 | 下载链接 |
181
+ |:----------------------|:-------------------|:---------------:|
182
+ | OmniLMM-12B | 性能最强的版本 | [🤗](https://huggingface.co/openbmb/OmniLMM-12B) &nbsp;&nbsp; [<img src="./assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/OmniLMM-12B/files) |
183
+
r1-a/response_generation/minicpm/MiniCPM-o/docs/omnilmm_en.md ADDED
@@ -0,0 +1,155 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## OmniLMM-12B
2
+
3
+ > OmniLMM-12B is released at early time of this project. We recommond you to use our [recently released models](./README.md), for better performance and efficiency.
4
+
5
+ > Archieve at: 2024-05-19
6
+
7
+
8
+ **OmniLMM-12B** is the most capable version. The model is built based on EVA02-5B and Zephyr-7B-β, connected with a perceiver resampler layer, and trained on multimodal data in a curriculum fashion. The model has three notable features:
9
+
10
+ - 🔥 **Strong Performance.**
11
+
12
+ OmniLMM-12B achieves **leading performance** among models with comparable sizes, surpassing established LMMs on multiple benchmarks (including MME, MMBench, SEED-Bench, etc). The model also endows rich multi-modal world knowledge.
13
+
14
+ - 🏆 **Trustworthy Behavior.**
15
+
16
+ LMMs are known for suffering from hallucination, often generating text that is not factually grounded in images (e.g., faithfully describing non-existing objects in images). OmniLMM-12B is **the first state-of-the-art open-source LMM aligned via multimodal RLHF for trustworthy behavior** (using the recent [RLHF-V](https://rlhf-v.github.io/) technique). It **ranks #1** among open-source models on [MMHal-Bench](https://huggingface.co/datasets/Shengcao1006/MMHal-Bench), and **outperforms GPT-4V** on [Object HalBench](https://arxiv.org/abs/2312.00849).
17
+
18
+ - 🕹 **Real-time Multimodal Interaction.**
19
+
20
+ We combine the OmniLMM-12B and GPT-3.5 (text-only) into a **real-time multimodal interactive assistant**. The assistant accepts video streams from the camera and speech streams from the microphone and emits speech output. While still primary, we find the model can **replicate some of the fun cases shown in the Gemini Demo video, without any video edition**.
21
+
22
+
23
+ ### Evaluation <!-- omit in toc -->
24
+ <div align="center">
25
+ <img src=assets/radar_omnilmm12b.png width=66% />
26
+ </div>
27
+ <details>
28
+ <summary>Click to view results on MME, MMBench, MMMU, MMBench, MMHal-Bench, Object HalBench, SeedBench, LLaVA Bench, MathVista. </summary>
29
+
30
+ <table>
31
+ <thead>
32
+ <tr>
33
+ <th align="left">Model</th>
34
+ <th>Size</th>
35
+ <th>MME</th>
36
+ <th nowrap="nowrap">MMB dev (en)</th>
37
+ <th nowrap="nowrap" >MMMU val</th>
38
+ <th nowrap="nowrap" >MMHal-Bench</th>
39
+ <th nowrap="nowrap" >Object HalBench</th>
40
+ <th nowrap="nowrap" >SeedBench-I</th>
41
+ <th>MathVista</th>
42
+ <th nowrap="nowrap" >LLaVA Bench</th>
43
+ </tr>
44
+ </thead>
45
+ <tbody align="center">
46
+ <tr>
47
+ <td align="left">GPT-4V†</td>
48
+ <td>-</td>
49
+ <td>1771.5</td>
50
+ <td>75.1 </td>
51
+ <td>56.8</td>
52
+ <td>3.53 / 70.8</td>
53
+ <td>86.4 / 92.7</td>
54
+ <td>71.6 </td>
55
+ <td>47.8 </td>
56
+ <td>93.1 </td>
57
+ </tr>
58
+ <tr>
59
+ <td nowrap="nowrap" align="left">Qwen-VL-Plus†</td>
60
+ <td>-</td>
61
+ <td>2183.4</td>
62
+ <td>66.2 </td>
63
+ <td>45.2</td>
64
+ <td>- </td>
65
+ <td>- </td>
66
+ <td>65.7 </td>
67
+ <td>36.0 </td>
68
+ <td>73.7 </td>
69
+ </tr>
70
+ <tr>
71
+ <td align="left">Yi-VL 6B</td>
72
+ <td align="right">6.7B </td>
73
+ <td>1915.1 </td>
74
+ <td>68.6 </td>
75
+ <td>40.3 </td>
76
+ <td>- </td>
77
+ <td>- </td>
78
+ <td>67.5 </td>
79
+ <td>28.8 </td>
80
+ <td>51.9 </td>
81
+ </tr>
82
+ <tr>
83
+ <td nowrap="nowrap" align="left" >Qwen-VL-Chat</td>
84
+ <td align="right">9.6B</td>
85
+ <td>1860.0</td>
86
+ <td>60.6 </td>
87
+ <td>35.9</td>
88
+ <td>2.93 / 59.4</td>
89
+ <td>56.2 / 80.0</td>
90
+ <td>64.8 </td>
91
+ <td>33.8 </td>
92
+ <td>67.7 </td>
93
+ </tr>
94
+ <tr>
95
+ <td align="left" >CogVLM-Chat</td>
96
+ <td align="right">17.4B</td>
97
+ <td>1736.6</td>
98
+ <td>63.7 </td>
99
+ <td>32.1 </td>
100
+ <td>2.68 / 52.1 </td>
101
+ <td>73.6 / 87.4 </td>
102
+ <td>68.8 </td>
103
+ <td>34.7 </td>
104
+ <td>73.9 </td>
105
+ </tr>
106
+ <tr>
107
+ <td align="left" >LLaVA 1.5</td>
108
+ <td align="right">13.6B </td>
109
+ <td>1808.4 </td>
110
+ <td>68.2 </td>
111
+ <td>36.4 </td>
112
+ <td>2.71 / 51.0 </td>
113
+ <td>53.7 / 77.4 </td>
114
+ <td>68.1 </td>
115
+ <td>26.4 </td>
116
+ <td>64.6 </td>
117
+ </tr>
118
+ <tr>
119
+ <td nowrap="nowrap" align="left" ><b>OmniLMM-12B</b></td>
120
+ <td align="right">11.6B </td>
121
+ <td>1935.8 </td>
122
+ <td>71.6 </td>
123
+ <td>40.7 </td>
124
+ <td>3.45 / 68.8 </td>
125
+ <td>90.3 / 95.5 </td>
126
+ <td>71.1 </td>
127
+ <td>34.9 </td>
128
+ <td>72.0 </td>
129
+ </tr>
130
+ </tbody>
131
+ </table>
132
+ <small>†: Proprietary models</small>
133
+ <br>
134
+ </details>
135
+
136
+ ### Examples <!-- omit in toc -->
137
+
138
+ <table align="center" >
139
+ <p align="center" >
140
+ <img src="assets/omnilmm-12b-examples_2.png" />
141
+ </p>
142
+ </table>
143
+
144
+
145
+ We combine the OmniLMM-12B and GPT-3.5 (text-only) into a **real-time multimodal interactive assistant**. Video frames are described in text using OmniLMM-12B, and ChatGPT 3.5 (text-only) is employed to generate response according to the descriptions and user prompts. The demo video is a raw recording without edition.
146
+
147
+ <div align="center" >
148
+ <video controls src="https://github.com/OpenBMB/OmniLMM/assets/157115220/485a8f52-fb4d-4eca-8fee-506347efcfc6" type="video/mp4" width=80%/>
149
+ </div>
150
+
151
+ ### Model Zoo
152
+
153
+ | Model | Description | Download Link |
154
+ |:----------------------|:-------------------|:---------------:|
155
+ | OmniLMM-12B | The most capable version with leading performance. | [🤗](https://huggingface.co/openbmb/OmniLMM-12B) &nbsp;&nbsp; [<img src="./assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/OmniLMM-12B/files) |
r1-a/response_generation/minicpm/MiniCPM-o/docs/swift_train_and_infer.md ADDED
@@ -0,0 +1,135 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## SWIFT install
2
+ You can quickly install SWIFT using bash commands.
3
+
4
+ ``` bash
5
+ git clone https://github.com/modelscope/swift.git
6
+ cd swift
7
+ pip install -r requirements.txt
8
+ pip install -e '.[llm]'
9
+ ```
10
+
11
+ ## SWIFT Infer
12
+ Inference using SWIFT can be carried out in two ways: through a command line interface and via Python code.
13
+
14
+ ### Quick start
15
+ Here are the steps to launch SWIFT from the Bash command line:
16
+
17
+ 1. Run the bash code will download the model of MiniCPM-Llama3-V-2_5 and run the inference
18
+ ``` shell
19
+ CUDA_VISIBLE_DEVICES=0 swift infer --model_type minicpm-v-v2_5-chat
20
+ ```
21
+
22
+ 2. You can also run the code with more arguments below to run the inference:
23
+ ```
24
+ model_id_or_path # Can be the model ID from Hugging Face or the local path to the model
25
+ infer_backend ['AUTO', 'vllm', 'pt'] # Backend for inference, default is auto
26
+ dtype ['bf16', 'fp16', 'fp32', 'AUTO'] # Computational precision
27
+ max_length # Maximum length
28
+ max_new_tokens: int = 2048 # Maximum number of tokens to generate
29
+ do_sample: bool = True # Whether to sample during generation
30
+ temperature: float = 0.3 # Temperature coefficient during generation
31
+ top_k: int = 20
32
+ top_p: float = 0.7
33
+ repetition_penalty: float = 1. # Penalty for repetition
34
+ num_beams: int = 1 # Number of beams for beam search
35
+ stop_words: List[str] = None # List of stop words
36
+ quant_method ['bnb', 'hqq', 'eetq', 'awq', 'gptq', 'aqlm'] # Quantization method for the model
37
+ quantization_bit [0, 1, 2, 3, 4, 8] # Default is 0, which means no quantization is used
38
+ ```
39
+ 3. Example:
40
+ ``` shell
41
+ CUDA_VISIBLE_DEVICES=0,1 swift infer \
42
+ --model_type minicpm-v-v2_5-chat \
43
+ --model_id_or_path /root/ld/ld_model_pretrain/MiniCPM-Llama3-V-2_5 \
44
+ --dtype bf16
45
+ ```
46
+ ### Python code with SWIFT infer
47
+ The following demonstrates using Python code to initiate inference with the MiniCPM-Llama3-V-2_5 model through SWIFT.
48
+
49
+ ```python
50
+ import os
51
+ os.environ['CUDA_VISIBLE_DEVICES'] = '0,1' # Set the number of GPUs to use
52
+
53
+ from swift.llm import (
54
+ get_model_tokenizer, get_template, inference, ModelType,
55
+ get_default_template_type, inference_stream
56
+ ) # Import necessary modules
57
+
58
+ from swift.utils import seed_everything # Set random seed
59
+ import torch
60
+
61
+ model_type = ModelType.minicpm_v_v2_5_chat
62
+ template_type = get_default_template_type(model_type) # Obtain the template type, primarily used for constructing special tokens and image processing workflow
63
+ print(f'template_type: {template_type}')
64
+
65
+ model, tokenizer = get_model_tokenizer(model_type, torch.bfloat16,
66
+ model_id_or_path='/root/ld/ld_model_pretrain/MiniCPM-Llama3-V-2_5',
67
+ model_kwargs={'device_map': 'auto'}) # Load the model, set model type, model path, model parameters, device allocation, etc., computation precision, etc.
68
+ model.generation_config.max_new_tokens = 256
69
+ template = get_template(template_type, tokenizer) # Construct the template based on the template type
70
+ seed_everything(42)
71
+
72
+ images = ['http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/road.png'] # Image URL
73
+ query = '距离各城市多远?' # Note: Query is still in Chinese, consider translating if needed
74
+ response, history = inference(model, template, query, images=images) # Obtain results through inference
75
+ print(f'query: {query}')
76
+ print(f'response: {response}')
77
+
78
+ # Streaming output
79
+ query = '距离最远的城市是哪?' # Note: Query is still in Chinese, consider translating if needed
80
+ gen = inference_stream(model, template, query, history, images=images) # Call the streaming output interface
81
+ print_idx = 0
82
+ print(f'query: {query}\nresponse: ', end='')
83
+ for response, history in gen:
84
+ delta = response[print_idx:]
85
+ print(delta, end='', flush=True)
86
+ print_idx = len(response)
87
+ print()
88
+ print(f'history: {history}')
89
+ ```
90
+
91
+ ## SWIFT train
92
+ SWIFT supports training on the local dataset,the training steps are as follows:
93
+ 1. Make the train data like this:
94
+ ```jsonl
95
+ {"query": "What does this picture describe?", "response": "This picture has a giant panda.", "images": ["local_image_path"]}
96
+ {"query": "What does this picture describe?", "response": "This picture has a giant panda.", "history": [], "images": ["image_path"]}
97
+ {"query": "Is bamboo tasty?", "response": "It seems pretty tasty judging by the panda's expression.", "history": [["What's in this picture?", "There's a giant panda in this picture."], ["What is the panda doing?", "Eating bamboo."]], "images": ["image_url"]}
98
+ ```
99
+ 2. LoRA Tuning:
100
+
101
+ The LoRA target model are k and v weight in LLM you should pay attention to the eval_steps,maybe you should set the eval_steps to a large value, like 200000,beacuase in the eval time , SWIFT will return a memory bug so you should set the eval_steps to a very large value.
102
+ ```shell
103
+ # Experimental environment: A100
104
+ # 32GB GPU memory
105
+ CUDA_VISIBLE_DEVICES=0 swift sft \
106
+ --model_type minicpm-v-v2_5-chat \
107
+ --dataset coco-en-2-mini \
108
+ ```
109
+ 3. All parameters finetune:
110
+
111
+ When the argument of lora_target_modules is ALL, the model will finetune all the parameters.
112
+ ```shell
113
+ CUDA_VISIBLE_DEVICES=0,1 swift sft \
114
+ --model_type minicpm-v-v2_5-chat \
115
+ --dataset coco-en-2-mini \
116
+ --lora_target_modules ALL \
117
+ --eval_steps 200000
118
+ ```
119
+
120
+ ## LoRA Merge and Infer
121
+ The LoRA weight can be merge to the base model and then load to infer.
122
+
123
+ 1. Load the LoRA weight to infer run the follow code:
124
+ ```shell
125
+ CUDA_VISIBLE_DEVICES=0 swift infer \
126
+ --ckpt_dir /your/lora/save/checkpoint
127
+ ```
128
+ 2. Merge the LoRA weight to the base model:
129
+
130
+ The code will load and merge the LoRA weight to the base model, save the merge model to the LoRA save path and load the merge model to infer
131
+ ```shell
132
+ CUDA_VISIBLE_DEVICES=0 swift infer \
133
+ --ckpt_dir your/lora/save/checkpoint \
134
+ --merge_lora true
135
+ ```