Add files using upload-large-folder tool
Browse files- r1-a/response_generation/minicpm/MiniCPM-o/assets/minicpmv-llama3-v2.5/temp +1 -0
- r1-a/response_generation/minicpm/MiniCPM-o/assets/wechat.png +0 -0
- r1-a/response_generation/minicpm/MiniCPM-o/assets/worldmap_ck.jpg +0 -0
- r1-a/response_generation/minicpm/MiniCPM-o/assets/xinferenc_demo_image/xinference_register_model2.png +0 -0
- r1-a/response_generation/minicpm/MiniCPM-o/assets/xinferenc_demo_image/xinference_search_box.png +0 -0
- r1-a/response_generation/minicpm/MiniCPM-o/assets/xinferenc_demo_image/xinference_webui_button.png +0 -0
- r1-a/response_generation/minicpm/MiniCPM-o/assets/zhihu.webp +0 -0
- r1-a/response_generation/minicpm/MiniCPM-o/docs/best_practice_summary.md +23 -0
- r1-a/response_generation/minicpm/MiniCPM-o/docs/best_practice_summary_zh.md +22 -0
- r1-a/response_generation/minicpm/MiniCPM-o/docs/compare_with_phi-3_vision.md +27 -0
- r1-a/response_generation/minicpm/MiniCPM-o/docs/faqs.md +30 -0
- r1-a/response_generation/minicpm/MiniCPM-o/docs/inference_on_multiple_gpus.md +159 -0
- r1-a/response_generation/minicpm/MiniCPM-o/docs/llamafactory_train_and_infer.md +445 -0
- r1-a/response_generation/minicpm/MiniCPM-o/docs/minicpm_llama3_v2dot5.md +333 -0
- r1-a/response_generation/minicpm/MiniCPM-o/docs/minicpm_v1.md +214 -0
- r1-a/response_generation/minicpm/MiniCPM-o/docs/minicpm_v2.md +299 -0
- r1-a/response_generation/minicpm/MiniCPM-o/docs/minicpm_v2dot6.md +945 -0
- r1-a/response_generation/minicpm/MiniCPM-o/docs/omnilmm.md +183 -0
- r1-a/response_generation/minicpm/MiniCPM-o/docs/omnilmm_en.md +155 -0
- r1-a/response_generation/minicpm/MiniCPM-o/docs/swift_train_and_infer.md +135 -0
r1-a/response_generation/minicpm/MiniCPM-o/assets/minicpmv-llama3-v2.5/temp
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
|
r1-a/response_generation/minicpm/MiniCPM-o/assets/wechat.png
ADDED
|
r1-a/response_generation/minicpm/MiniCPM-o/assets/worldmap_ck.jpg
ADDED
|
r1-a/response_generation/minicpm/MiniCPM-o/assets/xinferenc_demo_image/xinference_register_model2.png
ADDED
|
r1-a/response_generation/minicpm/MiniCPM-o/assets/xinferenc_demo_image/xinference_search_box.png
ADDED
|
r1-a/response_generation/minicpm/MiniCPM-o/assets/xinferenc_demo_image/xinference_webui_button.png
ADDED
|
r1-a/response_generation/minicpm/MiniCPM-o/assets/zhihu.webp
ADDED
|
r1-a/response_generation/minicpm/MiniCPM-o/docs/best_practice_summary.md
ADDED
|
@@ -0,0 +1,23 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# MiniCPM-V Best Practices
|
| 2 |
+
|
| 3 |
+
**MiniCPM-V** is a series of end-side multimodal LLMs (MLLMs) designed for vision-language understanding. The models take image, video and text as inputs and provide high-quality text output, aiming to achieve **strong performance and efficient deployment**. The most notable models in this series currently include MiniCPM-Llama3-V 2.5 and MiniCPM-V 2.6. The following sections provide detailed tutorials and guidance for each version of the MiniCPM-V models.
|
| 4 |
+
|
| 5 |
+
|
| 6 |
+
## MiniCPM-V 2.6
|
| 7 |
+
|
| 8 |
+
MiniCPM-V 2.6 is the latest and most capable model in the MiniCPM-V series. With a total of 8B parameters, the model **surpasses GPT-4V in single image, multi-image and video understanding**. It outperforms **GPT-4o mini, Gemini 1.5 Pro and Claude 3.5 Sonnet** in single image understanding, and advances MiniCPM-Llama3-V 2.5's features such as strong OCR capability, trustworthy behavior, multilingual support, and end-side deployment. Due to its superior token density, MiniCPM-V 2.6 can for the first time support real-time video understanding on end-side devices such as iPad.
|
| 9 |
+
|
| 10 |
+
* [Deployment Tutorial](https://modelbest.feishu.cn/wiki/C2BWw4ZP0iCDy7kkCPCcX2BHnOf)
|
| 11 |
+
* [Training Tutorial](https://modelbest.feishu.cn/wiki/GeHMwLMa0i2FhUkV0f6cz3HWnV1)
|
| 12 |
+
* [Quantization Tutorial](https://modelbest.feishu.cn/wiki/YvsPwnPwWiqUjlkmW0scQ76TnBb)
|
| 13 |
+
|
| 14 |
+
## MiniCPM-Llama3-V 2.5
|
| 15 |
+
|
| 16 |
+
MiniCPM-Llama3-V 2.5 is built on SigLip-400M and Llama3-8B-Instruct with a total of 8B parameters. It exhibits a significant performance improvement over MiniCPM-V 2.0.
|
| 17 |
+
|
| 18 |
+
* [Quantization Tutorial](https://modelbest.feishu.cn/wiki/Kc7ywV4X1ipSaAkuPFOc9SFun8b)
|
| 19 |
+
* [Training Tutorial](https://modelbest.feishu.cn/wiki/UpSiw63o9iGDhIklmwScX4a6nhW)
|
| 20 |
+
* [End-side Deployment](https://modelbest.feishu.cn/wiki/Lwr9wpOQdinr6AkLzHrc9LlgnJD)
|
| 21 |
+
* [Deployment Tutorial](https://modelbest.feishu.cn/wiki/LTOKw3Hz7il9kGkCLX9czsennKe)
|
| 22 |
+
* [HD Decoding Tutorial](https://modelbest.feishu.cn/wiki/Ug8iwdXfhiHVsDk2gGEco6xnnVg)
|
| 23 |
+
* [Model Structure](https://modelbest.feishu.cn/wiki/ACtAw9bOgiBQ9lkWyafcvtVEnQf)
|
r1-a/response_generation/minicpm/MiniCPM-o/docs/best_practice_summary_zh.md
ADDED
|
@@ -0,0 +1,22 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# MiniCPM-V 最佳实践
|
| 2 |
+
|
| 3 |
+
**MiniCPM-V**是面向图文理解的端侧多模态大模型系列。该系列模型接受图像和文本输入,并提供高质量的文本输出。自2024年2月以来,我们共发布了5个版本模型,旨在实现**领先的性能和高效的部署**,目前该系列最值得关注的模型包括:
|
| 4 |
+
|
| 5 |
+
## MiniCPM-V 2.6
|
| 6 |
+
|
| 7 |
+
MiniCPM-V系列的最新、性能最佳模型。总参数量 8B,单图、多图和视频理解性能**超越了 GPT-4V**。在单图理解上,它取得了优于 **GPT-4o mini、Gemini 1.5 Pro 和 Claude 3.5 Sonnet** 等商用闭源模型的表现,并进一步优化了 MiniCPM-Llama3-V 2.5 的 OCR、可信行为、多语言支持以及端侧部署等诸多特性。基于其领先的视觉 token 密度,MiniCPM-V 2.6 成为了首个支持在 iPad 等端侧设备上进行实时视频理解的多模态大模型。
|
| 8 |
+
|
| 9 |
+
* [部署教程](https://modelbest.feishu.cn/wiki/LZxLwp4Lzi29vXklYLFchwN5nCf)
|
| 10 |
+
* [训练教程](https://modelbest.feishu.cn/wiki/HvfLwYzlIihqzXkmeCdczs6onmd)
|
| 11 |
+
* [量化教程](https://modelbest.feishu.cn/wiki/PAsHw6N6xiEy0DkJWpJcIocRnz9)
|
| 12 |
+
|
| 13 |
+
## MiniCPM-Llama3-V 2.5
|
| 14 |
+
|
| 15 |
+
MiniCPM-Llama3-V 2.5 基于 SigLip-400M 和 Llama3-8B-Instruct 构建,总共有 80 亿参数。其性能相比 MiniCPM-V 2.0 有了显著提升。
|
| 16 |
+
|
| 17 |
+
* [量化教程](https://modelbest.feishu.cn/wiki/O0KTwQV5piUPzTkRXl9cSFyHnQb)
|
| 18 |
+
* [训练教程](https://modelbest.feishu.cn/wiki/MPkPwvONEiZm3BkWMnyc83Tin4d)
|
| 19 |
+
* [端侧部署](https://modelbest.feishu.cn/wiki/CZZJw1EDGitSSZka664cZwbWnrb)
|
| 20 |
+
* [部署教程](https://modelbest.feishu.cn/wiki/BcHIwjOLGihJXCkkSdMc2WhbnZf)
|
| 21 |
+
* [高清解码教程](https://modelbest.feishu.cn/wiki/L0ajwm8VAiiPY6kDZfJce3B7nRg)
|
| 22 |
+
* [模型结构](https://modelbest.feishu.cn/wiki/X15nwGzqpioxlikbi2RcXDpJnjd)
|
r1-a/response_generation/minicpm/MiniCPM-o/docs/compare_with_phi-3_vision.md
ADDED
|
@@ -0,0 +1,27 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
## Phi-3-vision-128K-Instruct vs MiniCPM-Llama3-V 2.5
|
| 2 |
+
|
| 3 |
+
Comparison results of Phi-3-vision-128K-Instruct and MiniCPM-Llama3-V 2.5, regarding the model size, hardware requirements, and performances.
|
| 4 |
+
With int4 quantization, MiniCPM-Llama3-V 2.5 delivers **smooth inference with only 8GB of GPU memory**. In most benchmarks, MiniCPM-Llama3-V 2.5 achieves **better performance** compared with Phi-3-vision-128K-Instruct. Moreover, MiniCPM-Llama3-V 2.5 also exhibits **lower latency and better throughtput even without quantization**.
|
| 5 |
+
|
| 6 |
+
我们提供了从模型参数、硬件需求、性能指标等方面对比 Phi-3-vision-128K-Instruct 和 MiniCPM-Llama3-V 2.5 的结果。通过 int4 量化,MiniCPM-Llama3-V 2.5 **仅需 8GB 显存即可推理**。在大多数评测集上, MiniCPM-Llama3-V 2.5 相比于 Phi-3-vision-128K-Instruct 都展现出了**更优的性能表现**。 即使未经量化,MiniCPM-Llama3-V 2.5 的**推理延迟和吞吐率也都更具优势**。
|
| 7 |
+
|
| 8 |
+
<div align="center">
|
| 9 |
+
<img src="../assets/phi3_vision_comparison.jpg" width="85%" />
|
| 10 |
+
</div>
|
| 11 |
+
|
| 12 |
+
|
| 13 |
+
|
| 14 |
+
### Multilingual Capabilities(多语言能力对比)
|
| 15 |
+
|
| 16 |
+
|
| 17 |
+
MiniCPM-Llama3-V 2.5 exhibits **stronger multilingual** capabilities compared with Phi-3-vision-128K-Instruct on LLaVA Bench.
|
| 18 |
+
|
| 19 |
+
MiniCPM-Llama3-V 2.5 在对话和推理评测榜单 LLaVA Bench 上展现出了比 Phi-3-vision-128K-Instruct **更强的多语言的性能**。
|
| 20 |
+
|
| 21 |
+
<div align="center">
|
| 22 |
+
<img src="../assets/llavabench_compare_phi3.png" width="100%" />
|
| 23 |
+
<br>
|
| 24 |
+
Evaluation results of multilingual LLaVA Bench
|
| 25 |
+
<br>
|
| 26 |
+
多语言LLaVA Bench评测结果
|
| 27 |
+
</div>
|
r1-a/response_generation/minicpm/MiniCPM-o/docs/faqs.md
ADDED
|
@@ -0,0 +1,30 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
### FAQs
|
| 2 |
+
|
| 3 |
+
<details>
|
| 4 |
+
<summary>Q: How to choose between sampling or beam search for inference </summary>
|
| 5 |
+
|
| 6 |
+
In various scenarios, the quality of results obtained from beam search and sampling decoding strategies can vary. You can determine your decoding strategy based on the following aspects:
|
| 7 |
+
|
| 8 |
+
If you have the following needs, consider using sampling decoding:
|
| 9 |
+
|
| 10 |
+
1. You require faster inference speed.
|
| 11 |
+
2. You wish for a streaming generation approach.
|
| 12 |
+
3. Your task necessitates some open-ended responses.
|
| 13 |
+
|
| 14 |
+
If your task is about providing deterministic answers, you might want to experiment with beam search to see if it can achieve better outcomes.
|
| 15 |
+
</details>
|
| 16 |
+
|
| 17 |
+
|
| 18 |
+
<details>
|
| 19 |
+
<summary>Q: How to ensure that the model generates results of sufficient length</summary>
|
| 20 |
+
|
| 21 |
+
We've observed that during multi-language inference on MiniCPM-V 2.6, the generation sometimes ends prematurely. You can improve the results by passing a `min_new_tokens` parameter.
|
| 22 |
+
```python
|
| 23 |
+
res = model.chat(
|
| 24 |
+
image=None,
|
| 25 |
+
msgs=msgs,
|
| 26 |
+
tokenizer=tokenizer,
|
| 27 |
+
min_new_tokens=100
|
| 28 |
+
)
|
| 29 |
+
```
|
| 30 |
+
</details>
|
r1-a/response_generation/minicpm/MiniCPM-o/docs/inference_on_multiple_gpus.md
ADDED
|
@@ -0,0 +1,159 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
## Using MiniCPM-Llama3-V-2_5 with Multiple GPUs
|
| 2 |
+
|
| 3 |
+
Due to the limited memory capacity of a single GPU, it may be impossible to load the entire MiniCPMV model (the model weights account for 18 GiB) onto one device for inference (assume one gpu only has 12 GiB or 16 GiB GPU memory). To address this limitation, multi-GPU inference can be employed, where the model's layers are distributed across multiple GPUs.
|
| 4 |
+
|
| 5 |
+
A minimal modification method can be used to achieve this distribution, ensuring that the layers are assigned to different GPUs with minimal changes to the original model structure.
|
| 6 |
+
|
| 7 |
+
To implement this, we utilize features provided by `accelerate` library.
|
| 8 |
+
|
| 9 |
+
Install all requirements of MiniCPM-Llama3-V-2_5, additionally, you also need to install `accelerate`.
|
| 10 |
+
|
| 11 |
+
```bash
|
| 12 |
+
pip install accelerate
|
| 13 |
+
```
|
| 14 |
+
|
| 15 |
+
<br/>
|
| 16 |
+
|
| 17 |
+
### Example Usage for `2x16GiB` GPUs
|
| 18 |
+
|
| 19 |
+
Now we consider a demo for two GPUs, where each GPU has 16 GiB GPU memory.
|
| 20 |
+
|
| 21 |
+
1. Import necessary libraries.
|
| 22 |
+
|
| 23 |
+
```python
|
| 24 |
+
from PIL import Image
|
| 25 |
+
import torch
|
| 26 |
+
from transformers import AutoConfig, AutoModel, AutoTokenizer
|
| 27 |
+
from accelerate import init_empty_weights, infer_auto_device_map, load_checkpoint_in_model, dispatch_model
|
| 28 |
+
```
|
| 29 |
+
|
| 30 |
+
2. Download model weights.
|
| 31 |
+
|
| 32 |
+
```python
|
| 33 |
+
MODEL_PATH = '/local/path/to/MiniCPM-Llama3-V-2_5' # you can download in advance or use `openbmb/MiniCPM-Llama3-V-2_5`
|
| 34 |
+
```
|
| 35 |
+
|
| 36 |
+
3. Determine the distribution of layers on multiple GPUs.
|
| 37 |
+
|
| 38 |
+
```python
|
| 39 |
+
max_memory_each_gpu = '10GiB' # Define the maximum memory to use on each gpu, here we suggest using a balanced value, because the weight is not everything, the intermediate activation value also uses GPU memory (10GiB < 16GiB)
|
| 40 |
+
|
| 41 |
+
gpu_device_ids = [0, 1] # Define which gpu to use (now we have two GPUs, each has 16GiB memory)
|
| 42 |
+
|
| 43 |
+
no_split_module_classes = ["LlamaDecoderLayer"]
|
| 44 |
+
|
| 45 |
+
max_memory = {
|
| 46 |
+
device_id: max_memory_each_gpu for device_id in gpu_device_ids
|
| 47 |
+
}
|
| 48 |
+
|
| 49 |
+
config = AutoConfig.from_pretrained(
|
| 50 |
+
MODEL_PATH,
|
| 51 |
+
trust_remote_code=True
|
| 52 |
+
)
|
| 53 |
+
|
| 54 |
+
tokenizer = AutoTokenizer.from_pretrained(
|
| 55 |
+
MODEL_PATH,
|
| 56 |
+
trust_remote_code=True
|
| 57 |
+
)
|
| 58 |
+
|
| 59 |
+
with init_empty_weights():
|
| 60 |
+
model = AutoModel.from_config(
|
| 61 |
+
config,
|
| 62 |
+
torch_dtype=torch.float16,
|
| 63 |
+
trust_remote_code=True
|
| 64 |
+
)
|
| 65 |
+
|
| 66 |
+
device_map = infer_auto_device_map(
|
| 67 |
+
model,
|
| 68 |
+
max_memory=max_memory, no_split_module_classes=no_split_module_classes
|
| 69 |
+
)
|
| 70 |
+
|
| 71 |
+
print("auto determined device_map", device_map)
|
| 72 |
+
|
| 73 |
+
# Here we want to make sure the input and output layer are all on the first gpu to avoid any modifications to original inference script.
|
| 74 |
+
|
| 75 |
+
device_map["llm.model.embed_tokens"] = 0
|
| 76 |
+
device_map["llm.model.layers.0"] = 0
|
| 77 |
+
device_map["llm.lm_head"] = 0
|
| 78 |
+
device_map["vpm"] = 0
|
| 79 |
+
device_map["resampler"] = 0
|
| 80 |
+
|
| 81 |
+
print("modified device_map", device_map)
|
| 82 |
+
|
| 83 |
+
```
|
| 84 |
+
|
| 85 |
+
You may see this output:
|
| 86 |
+
|
| 87 |
+
```
|
| 88 |
+
modified device_map OrderedDict([('llm.model.embed_tokens', 0), ('llm.model.layers.0', 0), ('llm.model.layers.1', 0), ('llm.model.layers.2', 0), ('llm.model.layers.3', 0), ('llm.model.layers.4', 0), ('llm.model.layers.5', 0), ('llm.model.layers.6', 0), ('llm.model.layers.7', 0), ('llm.model.layers.8', 0), ('llm.model.layers.9', 0), ('llm.model.layers.10', 0), ('llm.model.layers.11', 0), ('llm.model.layers.12', 0), ('llm.model.layers.13', 0), ('llm.model.layers.14', 0), ('llm.model.layers.15', 0), ('llm.model.layers.16', 1), ('llm.model.layers.17', 1), ('llm.model.layers.18', 1), ('llm.model.layers.19', 1), ('llm.model.layers.20', 1), ('llm.model.layers.21', 1), ('llm.model.layers.22', 1), ('llm.model.layers.23', 1), ('llm.model.layers.24', 1), ('llm.model.layers.25', 1), ('llm.model.layers.26', 1), ('llm.model.layers.27', 1), ('llm.model.layers.28', 1), ('llm.model.layers.29', 1), ('llm.model.layers.30', 1), ('llm.model.layers.31', 1), ('llm.model.norm', 1), ('llm.lm_head', 0), ('vpm', 0), ('resampler', 0)])
|
| 89 |
+
```
|
| 90 |
+
|
| 91 |
+
4. Next, use the `device_map` to dispatch the model layers to corresponding gpus.
|
| 92 |
+
|
| 93 |
+
```python
|
| 94 |
+
load_checkpoint_in_model(
|
| 95 |
+
model,
|
| 96 |
+
MODEL_PATH,
|
| 97 |
+
device_map=device_map)
|
| 98 |
+
|
| 99 |
+
model = dispatch_model(
|
| 100 |
+
model,
|
| 101 |
+
device_map=device_map
|
| 102 |
+
)
|
| 103 |
+
|
| 104 |
+
torch.set_grad_enabled(False)
|
| 105 |
+
|
| 106 |
+
model.eval()
|
| 107 |
+
```
|
| 108 |
+
|
| 109 |
+
|
| 110 |
+
|
| 111 |
+
5. Chat!
|
| 112 |
+
|
| 113 |
+
```python
|
| 114 |
+
image_path = '/local/path/to/test.png'
|
| 115 |
+
|
| 116 |
+
response = model.chat(
|
| 117 |
+
image=Image.open(image_path).convert("RGB"),
|
| 118 |
+
msgs=[
|
| 119 |
+
{
|
| 120 |
+
"role": "user",
|
| 121 |
+
"content": "guess what I am doing?"
|
| 122 |
+
}
|
| 123 |
+
],
|
| 124 |
+
tokenizer=tokenizer
|
| 125 |
+
)
|
| 126 |
+
|
| 127 |
+
print(response)
|
| 128 |
+
```
|
| 129 |
+
|
| 130 |
+
In this case the OOM (CUDA out of memory) problem may be eliminated. We have tested that:
|
| 131 |
+
|
| 132 |
+
- it works well for `3000` text input tokens and `1000` text output tokens.
|
| 133 |
+
- it works well for a high-resolution input image.
|
| 134 |
+
|
| 135 |
+
<br/>
|
| 136 |
+
|
| 137 |
+
### Usage for general cases
|
| 138 |
+
|
| 139 |
+
It is similar to the previous example, but you may consider modifying these two variables.
|
| 140 |
+
|
| 141 |
+
```python
|
| 142 |
+
max_memory_each_gpu = '10GiB' # Define the maximum memory to use on each gpu, here we suggest using a balanced value, because the weight is not everything, the intermediate activation value also uses GPU memory (10GiB < 16GiB)
|
| 143 |
+
|
| 144 |
+
gpu_device_ids = [0, 1, ...] # Define which gpu to use (now we have two GPUs, each has 16GiB memory)
|
| 145 |
+
```
|
| 146 |
+
|
| 147 |
+
You may use the following shell script to monitor the memory usage during inference, if there is OOM, try to reduce `max_memory_each_gpu`, to make memory pressure more balanced across all gpus.
|
| 148 |
+
|
| 149 |
+
```bash
|
| 150 |
+
watch -n1 nvidia-smi
|
| 151 |
+
```
|
| 152 |
+
|
| 153 |
+
<br/>
|
| 154 |
+
|
| 155 |
+
|
| 156 |
+
### References
|
| 157 |
+
|
| 158 |
+
[Ref 1](https://zhuanlan.zhihu.com/p/639850033)
|
| 159 |
+
|
r1-a/response_generation/minicpm/MiniCPM-o/docs/llamafactory_train_and_infer.md
ADDED
|
@@ -0,0 +1,445 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Best Practice with LLaMA-Factory
|
| 2 |
+
|
| 3 |
+
## Contents <!-- omit in toc -->
|
| 4 |
+
|
| 5 |
+
- [Support Models](#Support-Models)
|
| 6 |
+
- [LLaMA-Factory Installation](#LLaMA-Factory-Installation)
|
| 7 |
+
- [Dataset Prepare](#Dataset-Prepare)
|
| 8 |
+
- [Image Dataset](#Image-Dataset)
|
| 9 |
+
- [Video Dataset](#Video-Dataset)
|
| 10 |
+
- [Audio Dataset](#Audio-Dataset)
|
| 11 |
+
- [Lora Fine-Tuning](#Lora-Fine-Tuning)
|
| 12 |
+
- [Full Parameters Fine-Tuning](#Full-Parameters-Fine-Tuning)
|
| 13 |
+
- [Inference](#Inference)
|
| 14 |
+
|
| 15 |
+
## Support Models
|
| 16 |
+
* [openbmb/MiniCPM-o-2_6](https://huggingface.co/openbmb/MiniCPM-o-2_6)
|
| 17 |
+
* [openbmb/MiniCPM-V-2_6](https://huggingface.co/openbmb/MiniCPM-V-2_6)
|
| 18 |
+
|
| 19 |
+
## LLaMA-Factory Installation
|
| 20 |
+
|
| 21 |
+
You can install LLaMA-Factory using commands below.
|
| 22 |
+
|
| 23 |
+
```
|
| 24 |
+
git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
|
| 25 |
+
cd LLaMA-Factory
|
| 26 |
+
pip install -e ".[torch,metrics,deepspeed,minicpm_v]"
|
| 27 |
+
mkdir configs # let's put all yaml files here
|
| 28 |
+
```
|
| 29 |
+
|
| 30 |
+
## Dataset Prepare
|
| 31 |
+
|
| 32 |
+
Refer to [data/dataset_info.json](https://github.com/hiyouga/LLaMA-Factory/blob/main/data/dataset_info.json) to add your customised dataset. Let's use the two existing demo datasets `mllm_demo`, `mllm_video_demo` and `mllm_audio_demo` as examples (audio is only for MiniCPM-o-2.6).
|
| 33 |
+
|
| 34 |
+
### Image Dataset
|
| 35 |
+
|
| 36 |
+
Refer to image sft demo data: [data/mllm_demo.json](https://github.com/hiyouga/LLaMA-Factory/blob/main/data/mllm_demo.json)
|
| 37 |
+
|
| 38 |
+
<details>
|
| 39 |
+
<summary>
|
| 40 |
+
<b>data/mllm_demo.json</b>
|
| 41 |
+
</summary>
|
| 42 |
+
|
| 43 |
+
```json
|
| 44 |
+
[
|
| 45 |
+
{
|
| 46 |
+
"messages": [
|
| 47 |
+
{
|
| 48 |
+
"content": "<image>Who are they?",
|
| 49 |
+
"role": "user"
|
| 50 |
+
},
|
| 51 |
+
{
|
| 52 |
+
"content": "They're Kane and Gretzka from Bayern Munich.",
|
| 53 |
+
"role": "assistant"
|
| 54 |
+
},
|
| 55 |
+
{
|
| 56 |
+
"content": "What are they doing?",
|
| 57 |
+
"role": "user"
|
| 58 |
+
},
|
| 59 |
+
{
|
| 60 |
+
"content": "They are celebrating on the soccer field.",
|
| 61 |
+
"role": "assistant"
|
| 62 |
+
}
|
| 63 |
+
],
|
| 64 |
+
"images": [
|
| 65 |
+
"mllm_demo_data/1.jpg"
|
| 66 |
+
]
|
| 67 |
+
},
|
| 68 |
+
{
|
| 69 |
+
"messages": [
|
| 70 |
+
{
|
| 71 |
+
"content": "<image>Who is he?",
|
| 72 |
+
"role": "user"
|
| 73 |
+
},
|
| 74 |
+
{
|
| 75 |
+
"content": "He's Thomas Muller from Bayern Munich.",
|
| 76 |
+
"role": "assistant"
|
| 77 |
+
},
|
| 78 |
+
{
|
| 79 |
+
"content": "Why is he on the ground?",
|
| 80 |
+
"role": "user"
|
| 81 |
+
},
|
| 82 |
+
{
|
| 83 |
+
"content": "Because he's sliding on his knees to celebrate.",
|
| 84 |
+
"role": "assistant"
|
| 85 |
+
}
|
| 86 |
+
],
|
| 87 |
+
"images": [
|
| 88 |
+
"mllm_demo_data/2.jpg"
|
| 89 |
+
]
|
| 90 |
+
},
|
| 91 |
+
{
|
| 92 |
+
"messages": [
|
| 93 |
+
{
|
| 94 |
+
"content": "<image>Please describe this image",
|
| 95 |
+
"role": "user"
|
| 96 |
+
},
|
| 97 |
+
{
|
| 98 |
+
"content": "Chinese astronaut Gui Haichao is giving a speech.",
|
| 99 |
+
"role": "assistant"
|
| 100 |
+
},
|
| 101 |
+
{
|
| 102 |
+
"content": "What has he accomplished?",
|
| 103 |
+
"role": "user"
|
| 104 |
+
},
|
| 105 |
+
{
|
| 106 |
+
"content": "He was appointed to be a payload specialist on Shenzhou 16 mission in June 2022, thus becoming the first Chinese civilian of Group 3 in space on 30 May 2023. He is responsible for the on-orbit operation of space science experimental payloads.",
|
| 107 |
+
"role": "assistant"
|
| 108 |
+
}
|
| 109 |
+
],
|
| 110 |
+
"images": [
|
| 111 |
+
"mllm_demo_data/3.jpg"
|
| 112 |
+
]
|
| 113 |
+
}
|
| 114 |
+
]
|
| 115 |
+
```
|
| 116 |
+
|
| 117 |
+
</details>
|
| 118 |
+
|
| 119 |
+
|
| 120 |
+
### Video Dataset
|
| 121 |
+
|
| 122 |
+
Refer to video sft demo data: [data/mllm_video_demo.json](https://github.com/hiyouga/LLaMA-Factory/blob/main/data/mllm_video_demo.json)
|
| 123 |
+
|
| 124 |
+
<details>
|
| 125 |
+
<summary>
|
| 126 |
+
<b>data/mllm_video_demo.json</b>
|
| 127 |
+
</summary>
|
| 128 |
+
|
| 129 |
+
```json
|
| 130 |
+
[
|
| 131 |
+
{
|
| 132 |
+
"messages": [
|
| 133 |
+
{
|
| 134 |
+
"content": "<video>Why is this video funny?",
|
| 135 |
+
"role": "user"
|
| 136 |
+
},
|
| 137 |
+
{
|
| 138 |
+
"content": "Because a baby is reading, and he is so cute!",
|
| 139 |
+
"role": "assistant"
|
| 140 |
+
}
|
| 141 |
+
],
|
| 142 |
+
"videos": [
|
| 143 |
+
"mllm_demo_data/1.mp4"
|
| 144 |
+
]
|
| 145 |
+
},
|
| 146 |
+
{
|
| 147 |
+
"messages": [
|
| 148 |
+
{
|
| 149 |
+
"content": "<video>What is she doing?",
|
| 150 |
+
"role": "user"
|
| 151 |
+
},
|
| 152 |
+
{
|
| 153 |
+
"content": "She is cooking.",
|
| 154 |
+
"role": "assistant"
|
| 155 |
+
}
|
| 156 |
+
],
|
| 157 |
+
"videos": [
|
| 158 |
+
"mllm_demo_data/2.avi"
|
| 159 |
+
]
|
| 160 |
+
},
|
| 161 |
+
{
|
| 162 |
+
"messages": [
|
| 163 |
+
{
|
| 164 |
+
"content": "<video>What's in the video?",
|
| 165 |
+
"role": "user"
|
| 166 |
+
},
|
| 167 |
+
{
|
| 168 |
+
"content": "A baby is playing in the living room.",
|
| 169 |
+
"role": "assistant"
|
| 170 |
+
}
|
| 171 |
+
],
|
| 172 |
+
"videos": [
|
| 173 |
+
"mllm_demo_data/3.mp4"
|
| 174 |
+
]
|
| 175 |
+
}
|
| 176 |
+
]
|
| 177 |
+
```
|
| 178 |
+
|
| 179 |
+
</details>
|
| 180 |
+
|
| 181 |
+
### Audio Dataset
|
| 182 |
+
|
| 183 |
+
Refer to audio sft demo data: [data/mllm_audio_demo.json](https://github.com/hiyouga/LLaMA-Factory/blob/main/data/mllm_audio_demo.json)
|
| 184 |
+
|
| 185 |
+
<details>
|
| 186 |
+
<summary>
|
| 187 |
+
<b>data/mllm_audio_demo.json</b>
|
| 188 |
+
</summary>
|
| 189 |
+
|
| 190 |
+
```json
|
| 191 |
+
[
|
| 192 |
+
{
|
| 193 |
+
"messages": [
|
| 194 |
+
{
|
| 195 |
+
"content": "<audio>What's that sound?",
|
| 196 |
+
"role": "user"
|
| 197 |
+
},
|
| 198 |
+
{
|
| 199 |
+
"content": "It is the sound of glass shattering.",
|
| 200 |
+
"role": "assistant"
|
| 201 |
+
}
|
| 202 |
+
],
|
| 203 |
+
"audios": [
|
| 204 |
+
"mllm_demo_data/1.mp3"
|
| 205 |
+
]
|
| 206 |
+
},
|
| 207 |
+
{
|
| 208 |
+
"messages": [
|
| 209 |
+
{
|
| 210 |
+
"content": "<audio>What can you hear?",
|
| 211 |
+
"role": "user"
|
| 212 |
+
},
|
| 213 |
+
{
|
| 214 |
+
"content": "A woman is coughing.",
|
| 215 |
+
"role": "assistant"
|
| 216 |
+
}
|
| 217 |
+
],
|
| 218 |
+
"audios": [
|
| 219 |
+
"mllm_demo_data/2.wav"
|
| 220 |
+
]
|
| 221 |
+
},
|
| 222 |
+
{
|
| 223 |
+
"messages": [
|
| 224 |
+
{
|
| 225 |
+
"content": "<audio>What does the person say?",
|
| 226 |
+
"role": "user"
|
| 227 |
+
},
|
| 228 |
+
{
|
| 229 |
+
"content": "Mister Quiller is the apostle of the middle classes and we are glad to welcome his gospel.",
|
| 230 |
+
"role": "assistant"
|
| 231 |
+
}
|
| 232 |
+
],
|
| 233 |
+
"audios": [
|
| 234 |
+
"mllm_demo_data/3.flac"
|
| 235 |
+
]
|
| 236 |
+
}
|
| 237 |
+
]
|
| 238 |
+
```
|
| 239 |
+
|
| 240 |
+
</details>
|
| 241 |
+
|
| 242 |
+
## Lora Fine-Tuning
|
| 243 |
+
|
| 244 |
+
We can use one command to do lora sft:
|
| 245 |
+
|
| 246 |
+
```shell
|
| 247 |
+
CUDA_VISIBLE_DEVICES=0 llamafactory-cli train configs/minicpmo_2_6_lora_sft.yaml
|
| 248 |
+
```
|
| 249 |
+
|
| 250 |
+
<details>
|
| 251 |
+
<summary>
|
| 252 |
+
<b>configs/minicpmo_2_6_lora_sft.yaml</b>
|
| 253 |
+
</summary>
|
| 254 |
+
|
| 255 |
+
```yaml
|
| 256 |
+
### model
|
| 257 |
+
model_name_or_path: openbmb/MiniCPM-o-2_6 # MiniCPM-o-2_6 MiniCPM-V-2_6
|
| 258 |
+
trust_remote_code: true
|
| 259 |
+
|
| 260 |
+
### method
|
| 261 |
+
stage: sft
|
| 262 |
+
do_train: true
|
| 263 |
+
finetuning_type: lora
|
| 264 |
+
lora_target: q_proj,v_proj
|
| 265 |
+
|
| 266 |
+
### dataset
|
| 267 |
+
dataset: mllm_demo # mllm_demo mllm_video_demo mllm_audio_demo
|
| 268 |
+
template: minicpm_o # minicpm_o minicpm_v
|
| 269 |
+
cutoff_len: 3072
|
| 270 |
+
max_samples: 1000
|
| 271 |
+
overwrite_cache: true
|
| 272 |
+
preprocessing_num_workers: 16
|
| 273 |
+
|
| 274 |
+
### output
|
| 275 |
+
output_dir: saves/minicpmo_2_6/lora/sft
|
| 276 |
+
logging_steps: 1
|
| 277 |
+
save_steps: 100
|
| 278 |
+
plot_loss: true
|
| 279 |
+
overwrite_output_dir: true
|
| 280 |
+
save_total_limit: 10
|
| 281 |
+
|
| 282 |
+
### train
|
| 283 |
+
per_device_train_batch_size: 2
|
| 284 |
+
gradient_accumulation_steps: 1
|
| 285 |
+
learning_rate: 1.0e-5
|
| 286 |
+
num_train_epochs: 20.0
|
| 287 |
+
lr_scheduler_type: cosine
|
| 288 |
+
warmup_ratio: 0.1
|
| 289 |
+
bf16: true
|
| 290 |
+
ddp_timeout: 180000000
|
| 291 |
+
save_only_model: true
|
| 292 |
+
|
| 293 |
+
### eval
|
| 294 |
+
do_eval: false
|
| 295 |
+
```
|
| 296 |
+
|
| 297 |
+
</details>
|
| 298 |
+
|
| 299 |
+
### Lora Model Export
|
| 300 |
+
|
| 301 |
+
One command to export lora model
|
| 302 |
+
|
| 303 |
+
```shell
|
| 304 |
+
llamafactory-cli export configs/minicpmo_2_6_lora_export.yaml
|
| 305 |
+
```
|
| 306 |
+
|
| 307 |
+
<details>
|
| 308 |
+
<summary>
|
| 309 |
+
<b>configs/minicpmo_2_6_lora_export.yaml</b>
|
| 310 |
+
</summary>
|
| 311 |
+
|
| 312 |
+
```yaml
|
| 313 |
+
### model
|
| 314 |
+
model_name_or_path: openbmb/MiniCPM-o-2_6 # MiniCPM-o-2_6 MiniCPM-V-2_6
|
| 315 |
+
adapter_name_or_path: saves/minicpmo_2_6/lora/sft
|
| 316 |
+
template: minicpm_o # minicpm_o minicpm_v
|
| 317 |
+
finetuning_type: lora
|
| 318 |
+
trust_remote_code: true
|
| 319 |
+
|
| 320 |
+
### export
|
| 321 |
+
export_dir: models/minicpmo_2_6_lora_sft
|
| 322 |
+
export_size: 2
|
| 323 |
+
export_device: cpu
|
| 324 |
+
export_legacy_format: false
|
| 325 |
+
```
|
| 326 |
+
|
| 327 |
+
</details>
|
| 328 |
+
|
| 329 |
+
## Full Parameters Fine-Tuning
|
| 330 |
+
|
| 331 |
+
We can use one command to do full sft:
|
| 332 |
+
|
| 333 |
+
```shell
|
| 334 |
+
llamafactory-cli train configs/minicpmo_2_6_full_sft.yaml
|
| 335 |
+
```
|
| 336 |
+
|
| 337 |
+
<details>
|
| 338 |
+
<summary>
|
| 339 |
+
<b>configs/minicpmo_2_6_full_sft.yaml</b>
|
| 340 |
+
</summary>
|
| 341 |
+
|
| 342 |
+
```yaml
|
| 343 |
+
### model
|
| 344 |
+
model_name_or_path: openbmb/MiniCPM-o-2_6 # MiniCPM-o-2_6 MiniCPM-V-2_6
|
| 345 |
+
trust_remote_code: true
|
| 346 |
+
freeze_vision_tower: true
|
| 347 |
+
print_param_status: true
|
| 348 |
+
flash_attn: fa2
|
| 349 |
+
|
| 350 |
+
### method
|
| 351 |
+
stage: sft
|
| 352 |
+
do_train: true
|
| 353 |
+
finetuning_type: full
|
| 354 |
+
deepspeed: configs/deepspeed/ds_z2_config.json
|
| 355 |
+
|
| 356 |
+
### dataset
|
| 357 |
+
dataset: mllm_demo # mllm_demo mllm_video_demo
|
| 358 |
+
template: minicpm_o # minicpm_o minicpm_v
|
| 359 |
+
cutoff_len: 3072
|
| 360 |
+
max_samples: 1000
|
| 361 |
+
overwrite_cache: true
|
| 362 |
+
preprocessing_num_workers: 16
|
| 363 |
+
|
| 364 |
+
### output
|
| 365 |
+
output_dir: saves/minicpmo_2_6/full/sft
|
| 366 |
+
logging_steps: 1
|
| 367 |
+
save_steps: 100
|
| 368 |
+
plot_loss: true
|
| 369 |
+
overwrite_output_dir: true
|
| 370 |
+
save_total_limit: 10
|
| 371 |
+
|
| 372 |
+
### train
|
| 373 |
+
per_device_train_batch_size: 2
|
| 374 |
+
gradient_accumulation_steps: 1
|
| 375 |
+
learning_rate: 1.0e-5
|
| 376 |
+
num_train_epochs: 20.0
|
| 377 |
+
lr_scheduler_type: cosine
|
| 378 |
+
warmup_ratio: 0.1
|
| 379 |
+
bf16: true
|
| 380 |
+
ddp_timeout: 180000000
|
| 381 |
+
save_only_model: true
|
| 382 |
+
|
| 383 |
+
### eval
|
| 384 |
+
do_eval: false
|
| 385 |
+
```
|
| 386 |
+
</details>
|
| 387 |
+
|
| 388 |
+
## Inference
|
| 389 |
+
|
| 390 |
+
### Web UI ChatBox
|
| 391 |
+
|
| 392 |
+
Refer [LLaMA-Factory doc](https://github.com/hiyouga/LLaMA-Factory/tree/main/examples#inferring-lora-fine-tuned-models) for more inference usages.
|
| 393 |
+
|
| 394 |
+
For example, we can use one command to run web chat:
|
| 395 |
+
|
| 396 |
+
```shell
|
| 397 |
+
CUDA_VISIBLE_DEVICES=0 llamafactory-cli webchat configs/minicpmo_2_6_infer.yaml
|
| 398 |
+
```
|
| 399 |
+
|
| 400 |
+
<details>
|
| 401 |
+
<summary>
|
| 402 |
+
<b>configs/minicpmo_2_6_infer.yaml</b>
|
| 403 |
+
</summary>
|
| 404 |
+
|
| 405 |
+
```yaml
|
| 406 |
+
model_name_or_path: saves/minicpmo_2_6/full/sft
|
| 407 |
+
template: minicpm_o # minicpm_o minicpm_v
|
| 408 |
+
infer_backend: huggingface
|
| 409 |
+
trust_remote_code: true
|
| 410 |
+
```
|
| 411 |
+
</details>
|
| 412 |
+
|
| 413 |
+
### Official Code
|
| 414 |
+
You can also use official code to inference
|
| 415 |
+
|
| 416 |
+
<details>
|
| 417 |
+
<summary>
|
| 418 |
+
<b>official inference code</b>
|
| 419 |
+
</summary>
|
| 420 |
+
|
| 421 |
+
```python
|
| 422 |
+
# test.py
|
| 423 |
+
import torch
|
| 424 |
+
from PIL import Image
|
| 425 |
+
from transformers import AutoModel, AutoTokenizer
|
| 426 |
+
|
| 427 |
+
model_id = "saves/minicpmo_2_6/full/sft"
|
| 428 |
+
model = AutoModel.from_pretrained(model_id, trust_remote_code=True,
|
| 429 |
+
attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
|
| 430 |
+
model = model.eval().cuda()
|
| 431 |
+
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
|
| 432 |
+
|
| 433 |
+
image = Image.open('data/mllm_demo_data/1.jpg').convert('RGB')
|
| 434 |
+
question = 'Who are they??'
|
| 435 |
+
msgs = [{'role': 'user', 'content': [image, question]}]
|
| 436 |
+
|
| 437 |
+
res = model.chat(
|
| 438 |
+
image=None,
|
| 439 |
+
msgs=msgs,
|
| 440 |
+
tokenizer=tokenizer
|
| 441 |
+
)
|
| 442 |
+
print(res)
|
| 443 |
+
```
|
| 444 |
+
|
| 445 |
+
</details>
|
r1-a/response_generation/minicpm/MiniCPM-o/docs/minicpm_llama3_v2dot5.md
ADDED
|
@@ -0,0 +1,333 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
## MiniCPM-Llama3-V 2.5
|
| 2 |
+
|
| 3 |
+
> Archieve at: 2025-01-13
|
| 4 |
+
|
| 5 |
+
|
| 6 |
+
**MiniCPM-Llama3-V 2.5** is the latest model in the MiniCPM-V series. The model is built on SigLip-400M and Llama3-8B-Instruct with a total of 8B parameters. It exhibits a significant performance improvement over MiniCPM-V 2.0. Notable features of MiniCPM-Llama3-V 2.5 include:
|
| 7 |
+
|
| 8 |
+
- 🔥 **Leading Performance.**
|
| 9 |
+
MiniCPM-Llama3-V 2.5 has achieved an average score of 65.1 on OpenCompass, a comprehensive evaluation over 11 popular benchmarks. **With only 8B parameters, it surpasses widely used proprietary models like GPT-4V-1106, Gemini Pro, Claude 3 and Qwen-VL-Max** and greatly outperforms other Llama 3-based MLLMs.
|
| 10 |
+
|
| 11 |
+
- 💪 **Strong OCR Capabilities.**
|
| 12 |
+
MiniCPM-Llama3-V 2.5 can process images with any aspect ratio and up to 1.8 million pixels (e.g., 1344x1344), achieving a **700+ score on OCRBench, surpassing proprietary models such as GPT-4o, GPT-4V-0409, Qwen-VL-Max and Gemini Pro**. Based on recent user feedback, MiniCPM-Llama3-V 2.5 has now enhanced full-text OCR extraction, table-to-markdown conversion, and other high-utility capabilities, and has further strengthened its instruction-following and complex reasoning abilities, enhancing multimodal interaction experiences.
|
| 13 |
+
|
| 14 |
+
- 🏆 **Trustworthy Behavior.**
|
| 15 |
+
Leveraging the latest [RLAIF-V](https://github.com/RLHF-V/RLAIF-V/) method (the newest technique in the [RLHF-V](https://github.com/RLHF-V) [CVPR'24] series), MiniCPM-Llama3-V 2.5 exhibits more trustworthy behavior. It achieves a **10.3%** hallucination rate on Object HalBench, lower than GPT-4V-1106 (13.6%), achieving the best-level performance within the open-source community. [Data released](https://huggingface.co/datasets/openbmb/RLAIF-V-Dataset).
|
| 16 |
+
|
| 17 |
+
- 🌏 **Multilingual Support.**
|
| 18 |
+
Thanks to the strong multilingual capabilities of Llama 3 and the cross-lingual generalization technique from [VisCPM](https://github.com/OpenBMB/VisCPM), MiniCPM-Llama3-V 2.5 extends its bilingual (Chinese-English) multimodal capabilities to **over 30 languages including German, French, Spanish, Italian, Korean etc.** [All Supported Languages](./assets/minicpm-llama-v-2-5_languages.md).
|
| 19 |
+
|
| 20 |
+
- 🚀 **Efficient Deployment.**
|
| 21 |
+
MiniCPM-Llama3-V 2.5 systematically employs **model quantization, CPU optimizations, NPU optimizations and compilation optimizations**, achieving high-efficiency deployment on end-side devices. For mobile phones with Qualcomm chips, we have integrated the NPU acceleration framework QNN into llama.cpp for the first time. After systematic optimization, MiniCPM-Llama3-V 2.5 has realized a **150x acceleration in end-side MLLM image encoding** and a **3x speedup in language decoding**.
|
| 22 |
+
|
| 23 |
+
- 💫 **Easy Usage.**
|
| 24 |
+
MiniCPM-Llama3-V 2.5 can be easily used in various ways: (1) [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-v2.5/examples/minicpmv/README.md) and [ollama](https://github.com/OpenBMB/ollama/tree/minicpm-v2.5/examples/minicpm-v2.5) support for efficient CPU inference on local devices, (2) [GGUF](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5-gguf) format quantized models in 16 sizes, (3) efficient [LoRA](https://github.com/OpenBMB/MiniCPM-V/tree/main/finetune#lora-finetuning) fine-tuning with only 2 V100 GPUs, (4) [streaming output](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5#usage), (5) quick local WebUI demo setup with [Gradio](https://github.com/OpenBMB/MiniCPM-V/blob/main/web_demo_2.5.py) and [Streamlit](https://github.com/OpenBMB/MiniCPM-V/blob/main/web_demo_streamlit-2_5.py), and (6) interactive demos on [HuggingFace Spaces](https://huggingface.co/spaces/openbmb/MiniCPM-Llama3-V-2_5).
|
| 25 |
+
|
| 26 |
+
### Evaluation <!-- omit in toc -->
|
| 27 |
+
|
| 28 |
+
<div align="center">
|
| 29 |
+
<img src=../assets/MiniCPM-Llama3-V-2.5-peformance.png width=66% />
|
| 30 |
+
</div>
|
| 31 |
+
<details>
|
| 32 |
+
<summary>Click to view results on TextVQA, DocVQA, OCRBench, OpenCompass, MME, MMBench, MMMU, MathVista, LLaVA Bench, RealWorld QA, Object HalBench. </summary>
|
| 33 |
+
<div align="center">
|
| 34 |
+
|
| 35 |
+
<table style="margin: 0px auto;">
|
| 36 |
+
<thead>
|
| 37 |
+
<tr>
|
| 38 |
+
<th align="left">Model</th>
|
| 39 |
+
<th>Size</th>
|
| 40 |
+
<th>OCRBench</th>
|
| 41 |
+
<th>TextVQA val</th>
|
| 42 |
+
<th>DocVQA test</th>
|
| 43 |
+
<th>Open-Compass</th>
|
| 44 |
+
<th>MME</th>
|
| 45 |
+
<th>MMB test (en)</th>
|
| 46 |
+
<th>MMB test (cn)</th>
|
| 47 |
+
<th>MMMU val</th>
|
| 48 |
+
<th>Math-Vista</th>
|
| 49 |
+
<th>LLaVA Bench</th>
|
| 50 |
+
<th>RealWorld QA</th>
|
| 51 |
+
<th>Object HalBench</th>
|
| 52 |
+
</tr>
|
| 53 |
+
</thead>
|
| 54 |
+
<tbody align="center">
|
| 55 |
+
<tr>
|
| 56 |
+
<td colspan="14" align="left"><strong>Proprietary</strong></td>
|
| 57 |
+
</tr>
|
| 58 |
+
<tr>
|
| 59 |
+
<td nowrap="nowrap" align="left">Gemini Pro</td>
|
| 60 |
+
<td>-</td>
|
| 61 |
+
<td>680</td>
|
| 62 |
+
<td>74.6</td>
|
| 63 |
+
<td>88.1</td>
|
| 64 |
+
<td>62.9</td>
|
| 65 |
+
<td>2148.9</td>
|
| 66 |
+
<td>73.6</td>
|
| 67 |
+
<td>74.3</td>
|
| 68 |
+
<td>48.9</td>
|
| 69 |
+
<td>45.8</td>
|
| 70 |
+
<td>79.9</td>
|
| 71 |
+
<td>60.4</td>
|
| 72 |
+
<td>-</td>
|
| 73 |
+
</tr>
|
| 74 |
+
<tr>
|
| 75 |
+
<td nowrap="nowrap" align="left">GPT-4V (2023.11.06)</td>
|
| 76 |
+
<td>-</td>
|
| 77 |
+
<td>645</td>
|
| 78 |
+
<td>78.0</td>
|
| 79 |
+
<td>88.4</td>
|
| 80 |
+
<td>63.5</td>
|
| 81 |
+
<td>1771.5</td>
|
| 82 |
+
<td>77.0</td>
|
| 83 |
+
<td>74.4</td>
|
| 84 |
+
<td>53.8</td>
|
| 85 |
+
<td>47.8</td>
|
| 86 |
+
<td>93.1</td>
|
| 87 |
+
<td>63.0</td>
|
| 88 |
+
<td>86.4</td>
|
| 89 |
+
</tr>
|
| 90 |
+
<tr>
|
| 91 |
+
<td colspan="14" align="left"><strong>Open-source</strong></td>
|
| 92 |
+
</tr>
|
| 93 |
+
<tr>
|
| 94 |
+
<td nowrap="nowrap" align="left">Mini-Gemini</td>
|
| 95 |
+
<td>2.2B</td>
|
| 96 |
+
<td>-</td>
|
| 97 |
+
<td>56.2</td>
|
| 98 |
+
<td>34.2*</td>
|
| 99 |
+
<td>-</td>
|
| 100 |
+
<td>1653.0</td>
|
| 101 |
+
<td>-</td>
|
| 102 |
+
<td>-</td>
|
| 103 |
+
<td>31.7</td>
|
| 104 |
+
<td>-</td>
|
| 105 |
+
<td>-</td>
|
| 106 |
+
<td>-</td>
|
| 107 |
+
<td>-</td>
|
| 108 |
+
</tr>
|
| 109 |
+
<tr>
|
| 110 |
+
<td nowrap="nowrap" align="left">Qwen-VL-Chat</td>
|
| 111 |
+
<td>9.6B</td>
|
| 112 |
+
<td>488</td>
|
| 113 |
+
<td>61.5</td>
|
| 114 |
+
<td>62.6</td>
|
| 115 |
+
<td>51.6</td>
|
| 116 |
+
<td>1860.0</td>
|
| 117 |
+
<td>61.8</td>
|
| 118 |
+
<td>56.3</td>
|
| 119 |
+
<td>37.0</td>
|
| 120 |
+
<td>33.8</td>
|
| 121 |
+
<td>67.7</td>
|
| 122 |
+
<td>49.3</td>
|
| 123 |
+
<td>56.2</td>
|
| 124 |
+
</tr>
|
| 125 |
+
<tr>
|
| 126 |
+
<td nowrap="nowrap" align="left">DeepSeek-VL-7B</td>
|
| 127 |
+
<td>7.3B</td>
|
| 128 |
+
<td>435</td>
|
| 129 |
+
<td>64.7*</td>
|
| 130 |
+
<td>47.0*</td>
|
| 131 |
+
<td>54.6</td>
|
| 132 |
+
<td>1765.4</td>
|
| 133 |
+
<td>73.8</td>
|
| 134 |
+
<td>71.4</td>
|
| 135 |
+
<td>38.3</td>
|
| 136 |
+
<td>36.8</td>
|
| 137 |
+
<td>77.8</td>
|
| 138 |
+
<td>54.2</td>
|
| 139 |
+
<td>-</td>
|
| 140 |
+
</tr>
|
| 141 |
+
<tr>
|
| 142 |
+
<td nowrap="nowrap" align="left">Yi-VL-34B</td>
|
| 143 |
+
<td>34B</td>
|
| 144 |
+
<td>290</td>
|
| 145 |
+
<td>43.4*</td>
|
| 146 |
+
<td>16.9*</td>
|
| 147 |
+
<td>52.2</td>
|
| 148 |
+
<td><strong>2050.2</strong></td>
|
| 149 |
+
<td>72.4</td>
|
| 150 |
+
<td>70.7</td>
|
| 151 |
+
<td>45.1</td>
|
| 152 |
+
<td>30.7</td>
|
| 153 |
+
<td>62.3</td>
|
| 154 |
+
<td>54.8</td>
|
| 155 |
+
<td>79.3</td>
|
| 156 |
+
</tr>
|
| 157 |
+
<tr>
|
| 158 |
+
<td nowrap="nowrap" align="left">CogVLM-Chat</td>
|
| 159 |
+
<td>17.4B</td>
|
| 160 |
+
<td>590</td>
|
| 161 |
+
<td>70.4</td>
|
| 162 |
+
<td>33.3*</td>
|
| 163 |
+
<td>54.2</td>
|
| 164 |
+
<td>1736.6</td>
|
| 165 |
+
<td>65.8</td>
|
| 166 |
+
<td>55.9</td>
|
| 167 |
+
<td>37.3</td>
|
| 168 |
+
<td>34.7</td>
|
| 169 |
+
<td>73.9</td>
|
| 170 |
+
<td>60.3</td>
|
| 171 |
+
<td>73.6</td>
|
| 172 |
+
</tr>
|
| 173 |
+
<tr>
|
| 174 |
+
<td nowrap="nowrap" align="left">TextMonkey</td>
|
| 175 |
+
<td>9.7B</td>
|
| 176 |
+
<td>558</td>
|
| 177 |
+
<td>64.3</td>
|
| 178 |
+
<td>66.7</td>
|
| 179 |
+
<td>-</td>
|
| 180 |
+
<td>-</td>
|
| 181 |
+
<td>-</td>
|
| 182 |
+
<td>-</td>
|
| 183 |
+
<td>-</td>
|
| 184 |
+
<td>-</td>
|
| 185 |
+
<td>-</td>
|
| 186 |
+
<td>-</td>
|
| 187 |
+
<td>-</td>
|
| 188 |
+
</tr>
|
| 189 |
+
<tr>
|
| 190 |
+
<td nowrap="nowrap" align="left">Idefics2</td>
|
| 191 |
+
<td>8.0B</td>
|
| 192 |
+
<td>-</td>
|
| 193 |
+
<td>73.0</td>
|
| 194 |
+
<td>74.0</td>
|
| 195 |
+
<td>57.2</td>
|
| 196 |
+
<td>1847.6</td>
|
| 197 |
+
<td>75.7</td>
|
| 198 |
+
<td>68.6</td>
|
| 199 |
+
<td>45.2</td>
|
| 200 |
+
<td>52.2</td>
|
| 201 |
+
<td>49.1</td>
|
| 202 |
+
<td>60.7</td>
|
| 203 |
+
<td>-</td>
|
| 204 |
+
</tr>
|
| 205 |
+
<tr>
|
| 206 |
+
<td nowrap="nowrap" align="left">Bunny-LLama-3-8B</td>
|
| 207 |
+
<td>8.4B</td>
|
| 208 |
+
<td>-</td>
|
| 209 |
+
<td>-</td>
|
| 210 |
+
<td>-</td>
|
| 211 |
+
<td>54.3</td>
|
| 212 |
+
<td>1920.3</td>
|
| 213 |
+
<td>77.0</td>
|
| 214 |
+
<td>73.9</td>
|
| 215 |
+
<td>41.3</td>
|
| 216 |
+
<td>31.5</td>
|
| 217 |
+
<td>61.2</td>
|
| 218 |
+
<td>58.8</td>
|
| 219 |
+
<td>-</td>
|
| 220 |
+
</tr>
|
| 221 |
+
<tr>
|
| 222 |
+
<td nowrap="nowrap" align="left">LLaVA-NeXT Llama-3-8B</td>
|
| 223 |
+
<td>8.4B</td>
|
| 224 |
+
<td>-</td>
|
| 225 |
+
<td>-</td>
|
| 226 |
+
<td>78.2</td>
|
| 227 |
+
<td>-</td>
|
| 228 |
+
<td>1971.5</td>
|
| 229 |
+
<td>-</td>
|
| 230 |
+
<td>-</td>
|
| 231 |
+
<td>41.7</td>
|
| 232 |
+
<td>37.5</td>
|
| 233 |
+
<td>80.1</td>
|
| 234 |
+
<td>60.0</td>
|
| 235 |
+
<td>-</td>
|
| 236 |
+
</tr>
|
| 237 |
+
<tr>
|
| 238 |
+
<td nowrap="nowrap" align="left">Phi-3-vision-128k-instruct</td>
|
| 239 |
+
<td>4.2B</td>
|
| 240 |
+
<td>639*</td>
|
| 241 |
+
<td>70.9</td>
|
| 242 |
+
<td>-</td>
|
| 243 |
+
<td>-</td>
|
| 244 |
+
<td>1537.5*</td>
|
| 245 |
+
<td>-</td>
|
| 246 |
+
<td>-</td>
|
| 247 |
+
<td>40.4</td>
|
| 248 |
+
<td>44.5</td>
|
| 249 |
+
<td>64.2*</td>
|
| 250 |
+
<td>58.8*</td>
|
| 251 |
+
<td>-</td>
|
| 252 |
+
</tr>
|
| 253 |
+
<tr style="background-color: #e6f2ff;">
|
| 254 |
+
<td nowrap="nowrap" align="left">MiniCPM-V 1.0</td>
|
| 255 |
+
<td>2.8B</td>
|
| 256 |
+
<td>366</td>
|
| 257 |
+
<td>60.6</td>
|
| 258 |
+
<td>38.2</td>
|
| 259 |
+
<td>47.5</td>
|
| 260 |
+
<td>1650.2</td>
|
| 261 |
+
<td>64.1</td>
|
| 262 |
+
<td>62.6</td>
|
| 263 |
+
<td>38.3</td>
|
| 264 |
+
<td>28.9</td>
|
| 265 |
+
<td>51.3</td>
|
| 266 |
+
<td>51.2</td>
|
| 267 |
+
<td>78.4</td>
|
| 268 |
+
</tr>
|
| 269 |
+
<tr style="background-color: #e6f2ff;">
|
| 270 |
+
<td nowrap="nowrap" align="left">MiniCPM-V 2.0</td>
|
| 271 |
+
<td>2.8B</td>
|
| 272 |
+
<td>605</td>
|
| 273 |
+
<td>74.1</td>
|
| 274 |
+
<td>71.9</td>
|
| 275 |
+
<td>54.5</td>
|
| 276 |
+
<td>1808.6</td>
|
| 277 |
+
<td>69.1</td>
|
| 278 |
+
<td>66.5</td>
|
| 279 |
+
<td>38.2</td>
|
| 280 |
+
<td>38.7</td>
|
| 281 |
+
<td>69.2</td>
|
| 282 |
+
<td>55.8</td>
|
| 283 |
+
<td>85.5</td>
|
| 284 |
+
</tr>
|
| 285 |
+
<tr style="background-color: #e6f2ff;">
|
| 286 |
+
<td nowrap="nowrap" align="left">MiniCPM-Llama3-V 2.5</td>
|
| 287 |
+
<td>8.5B</td>
|
| 288 |
+
<td><strong>725</strong></td>
|
| 289 |
+
<td><strong>76.6</strong></td>
|
| 290 |
+
<td><strong>84.8</strong></td>
|
| 291 |
+
<td><strong>65.1</strong></td>
|
| 292 |
+
<td>2024.6</td>
|
| 293 |
+
<td><strong>77.2</strong></td>
|
| 294 |
+
<td><strong>74.2</strong></td>
|
| 295 |
+
<td><strong>45.8</strong></td>
|
| 296 |
+
<td><strong>54.3</strong></td>
|
| 297 |
+
<td><strong>86.7</strong></td>
|
| 298 |
+
<td><strong>63.5</strong></td>
|
| 299 |
+
<td><strong>89.7</strong></td>
|
| 300 |
+
</tr>
|
| 301 |
+
</tbody>
|
| 302 |
+
</table>
|
| 303 |
+
|
| 304 |
+
|
| 305 |
+
</div>
|
| 306 |
+
* We evaluate the officially released checkpoint by ourselves.
|
| 307 |
+
|
| 308 |
+
</details>
|
| 309 |
+
|
| 310 |
+
<div align="center">
|
| 311 |
+
<img src="../assets/llavabench_compare_3.png" width="100%" />
|
| 312 |
+
<br>
|
| 313 |
+
Evaluation results of multilingual LLaVA Bench
|
| 314 |
+
</div>
|
| 315 |
+
|
| 316 |
+
### Examples <!-- omit in toc -->
|
| 317 |
+
|
| 318 |
+
<table align="center" >
|
| 319 |
+
<p align="center" >
|
| 320 |
+
<img src="../assets/minicpmv-llama3-v2.5/cases_all.png" />
|
| 321 |
+
</p>
|
| 322 |
+
</table>
|
| 323 |
+
|
| 324 |
+
</details>
|
| 325 |
+
|
| 326 |
+
|
| 327 |
+
### Model Zoo
|
| 328 |
+
|
| 329 |
+
| Model | Device | Memory |          Description | Download |
|
| 330 |
+
|:-----------|:--:|:-----------:|:-------------------|:---------------:|
|
| 331 |
+
| MiniCPM-Llama3-V 2.5 | GPU | 19 GB | Strong end-side multimodal performance. | [🤗](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5/) [<img src="../assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-Llama3-V-2_5) |
|
| 332 |
+
| MiniCPM-Llama3-V 2.5 gguf | CPU | 6 GB | The gguf version, lower memory usage and faster inference. | [🤗](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5-gguf) [<img src="../assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-Llama3-V-2_5-gguf) |
|
| 333 |
+
| MiniCPM-Llama3-V 2.5 int4 | GPU | 8 GB | The int4 quantized version, lower GPU memory usage. | [🤗](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5-int4/) [<img src="../assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-Llama3-V-2_5-int4) |
|
r1-a/response_generation/minicpm/MiniCPM-o/docs/minicpm_v1.md
ADDED
|
@@ -0,0 +1,214 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
## MiniCPM-V 1.0
|
| 2 |
+
|
| 3 |
+
|
| 4 |
+
> Archive at:2024-05-19
|
| 5 |
+
|
| 6 |
+
MiniCPM-V 1.0 is an efficient version with promising performance for deployment. The model is built based on SigLip-400M and [MiniCPM-2.4B](https://github.com/OpenBMB/MiniCPM/), connected by a perceiver resampler. Notable features of MiniCPM-V 1.0 include:
|
| 7 |
+
|
| 8 |
+
- ⚡️ **High Efficiency.**
|
| 9 |
+
|
| 10 |
+
MiniCPM-V 1.0 can be **efficiently deployed on most GPU cards and personal computers**, and **even on end devices such as mobile phones**. In terms of visual encoding, we compress the image representations into 64 tokens via a perceiver resampler, which is significantly fewer than other LMMs based on MLP architecture (typically > 512 tokens). This allows MiniCPM-V 1.0 to operate with **much less memory cost and higher speed during inference**.
|
| 11 |
+
|
| 12 |
+
- 🔥 **Promising Performance.**
|
| 13 |
+
|
| 14 |
+
MiniCPM-V 1.0 achieves **state-of-the-art performance** on multiple benchmarks (including MMMU, MME, and MMbech, etc) among models with comparable sizes, surpassing existing LMMs built on Phi-2. It even **achieves comparable or better performance than the 9.6B Qwen-VL-Chat**.
|
| 15 |
+
|
| 16 |
+
- 🙌 **Bilingual Support.**
|
| 17 |
+
|
| 18 |
+
MiniCPM-V 1.0 is **the first end-deployable LMM supporting bilingual multimodal interaction in English and Chinese**. This is achieved by generalizing multimodal capabilities across languages, a technique from the ICLR 2024 spotlight [paper](https://arxiv.org/abs/2308.12038).
|
| 19 |
+
|
| 20 |
+
### Evaluation
|
| 21 |
+
|
| 22 |
+
<div align="center">
|
| 23 |
+
|
| 24 |
+
<table style="margin: 0px auto;">
|
| 25 |
+
<thead>
|
| 26 |
+
<tr>
|
| 27 |
+
<th align="left">Model</th>
|
| 28 |
+
<th>Size</th>
|
| 29 |
+
<th nowrap="nowrap" >Visual Tokens</th>
|
| 30 |
+
<th>MME</th>
|
| 31 |
+
<th nowrap="nowrap" >MMB dev (en)</th>
|
| 32 |
+
<th nowrap="nowrap" >MMB dev (zh)</th>
|
| 33 |
+
<th nowrap="nowrap" >MMMU val</th>
|
| 34 |
+
<th nowrap="nowrap" >CMMMU val</th>
|
| 35 |
+
</tr>
|
| 36 |
+
</thead>
|
| 37 |
+
<tbody align="center">
|
| 38 |
+
<tr>
|
| 39 |
+
<td align="left">LLaVA-Phi</td>
|
| 40 |
+
<td align="right">3B</td>
|
| 41 |
+
<td>576</td>
|
| 42 |
+
<td>1335</td>
|
| 43 |
+
<td>59.8</td>
|
| 44 |
+
<td>- </td>
|
| 45 |
+
<td>- </td>
|
| 46 |
+
<td>- </td>
|
| 47 |
+
</tr>
|
| 48 |
+
<tr>
|
| 49 |
+
<td nowrap="nowrap" align="left">MobileVLM</td>
|
| 50 |
+
<td align="right">3B</td>
|
| 51 |
+
<td>144</td>
|
| 52 |
+
<td>1289</td>
|
| 53 |
+
<td>59.6</td>
|
| 54 |
+
<td>- </td>
|
| 55 |
+
<td>- </td>
|
| 56 |
+
<td>- </td>
|
| 57 |
+
</tr>
|
| 58 |
+
<tr>
|
| 59 |
+
<td nowrap="nowrap" align="left" >Imp-v1</td>
|
| 60 |
+
<td align="right">3B</td>
|
| 61 |
+
<td>576</td>
|
| 62 |
+
<td>1434</td>
|
| 63 |
+
<td>66.5</td>
|
| 64 |
+
<td>- </td>
|
| 65 |
+
<td>- </td>
|
| 66 |
+
<td>- </td>
|
| 67 |
+
</tr>
|
| 68 |
+
<tr>
|
| 69 |
+
<td nowrap="nowrap" align="left" >Qwen-VL-Chat</td>
|
| 70 |
+
<td align="right" >9.6B</td>
|
| 71 |
+
<td>256</td>
|
| 72 |
+
<td>1487</td>
|
| 73 |
+
<td>60.6 </td>
|
| 74 |
+
<td>56.7 </td>
|
| 75 |
+
<td>35.9 </td>
|
| 76 |
+
<td>30.7 </td>
|
| 77 |
+
</tr>
|
| 78 |
+
<tr>
|
| 79 |
+
<td nowrap="nowrap" align="left" >CogVLM</td>
|
| 80 |
+
<td align="right">17.4B </td>
|
| 81 |
+
<td>1225</td>
|
| 82 |
+
<td>1438 </td>
|
| 83 |
+
<td>63.7 </td>
|
| 84 |
+
<td>53.8 </td>
|
| 85 |
+
<td>32.1 </td>
|
| 86 |
+
<td>- </td>
|
| 87 |
+
</tr>
|
| 88 |
+
<tr>
|
| 89 |
+
<td nowrap="nowrap" align="left" ><b>MiniCPM-V 1.0</b></td>
|
| 90 |
+
<td align="right">3B </td>
|
| 91 |
+
<td>64</td>
|
| 92 |
+
<td>1452 </td>
|
| 93 |
+
<td>67.9 </td>
|
| 94 |
+
<td>65.3 </td>
|
| 95 |
+
<td>37.2 </td>
|
| 96 |
+
<td>32.1 </td>
|
| 97 |
+
</tr>
|
| 98 |
+
</tbody>
|
| 99 |
+
</table>
|
| 100 |
+
|
| 101 |
+
</div>
|
| 102 |
+
|
| 103 |
+
### Examples
|
| 104 |
+
|
| 105 |
+
We deploy MiniCPM-V 1.0 on end devices. The demo video is the raw screen recording on a OnePlus 9R without edition.
|
| 106 |
+
|
| 107 |
+
<table align="center">
|
| 108 |
+
<p align="center">
|
| 109 |
+
<img src="assets/gif_cases/蛇_cn.gif" width=36%/>
|
| 110 |
+
<img src="assets/gif_cases/Mushroom_en.gif" width=36%/>
|
| 111 |
+
</p>
|
| 112 |
+
</table>
|
| 113 |
+
|
| 114 |
+
## Install
|
| 115 |
+
|
| 116 |
+
1. Clone this repository and navigate to the source folder
|
| 117 |
+
|
| 118 |
+
```bash
|
| 119 |
+
git clone https://github.com/OpenBMB/OmniLMM.git
|
| 120 |
+
cd OmniLMM
|
| 121 |
+
```
|
| 122 |
+
|
| 123 |
+
2. Create conda environment
|
| 124 |
+
|
| 125 |
+
```Shell
|
| 126 |
+
conda create -n OmniLMM python=3.10 -y
|
| 127 |
+
conda activate OmniLMM
|
| 128 |
+
```
|
| 129 |
+
|
| 130 |
+
3. Install dependencies
|
| 131 |
+
|
| 132 |
+
```shell
|
| 133 |
+
pip install -r requirements.txt
|
| 134 |
+
```
|
| 135 |
+
|
| 136 |
+
## Inference
|
| 137 |
+
|
| 138 |
+
### Model Zoo
|
| 139 |
+
| Model | Description | Download Link |
|
| 140 |
+
|:----------------------|:-------------------|:---------------:|
|
| 141 |
+
| MiniCPM-V 1.0 | The efficient version for end device deployment. | [🤗](https://huggingface.co/openbmb/MiniCPM-V) [<img src="./assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-V/files) |
|
| 142 |
+
|
| 143 |
+
|
| 144 |
+
### Multi-turn Conversation
|
| 145 |
+
Please refer to the following codes to run `MiniCPM-V 1.0`.
|
| 146 |
+
|
| 147 |
+
<div align="center">
|
| 148 |
+
<img src="assets/worldmap_ck.jpg" width="500px">
|
| 149 |
+
</div>
|
| 150 |
+
|
| 151 |
+
|
| 152 |
+
```python
|
| 153 |
+
from chat import OmniLMMChat, img2base64
|
| 154 |
+
|
| 155 |
+
chat_model = OmniLMMChat('openbmb/MiniCPM-V')
|
| 156 |
+
|
| 157 |
+
im_64 = img2base64('./assets/worldmap_ck.jpg')
|
| 158 |
+
|
| 159 |
+
# First round chat
|
| 160 |
+
msgs = [{"role": "user", "content": "What is interesting about this image?"}]
|
| 161 |
+
|
| 162 |
+
inputs = {"image": im_64, "question": json.dumps(msgs)}
|
| 163 |
+
answer = chat_model.chat(inputs)
|
| 164 |
+
print(answer)
|
| 165 |
+
|
| 166 |
+
# Second round chat
|
| 167 |
+
# pass history context of multi-turn conversation
|
| 168 |
+
msgs.append({"role": "assistant", "content": answer})
|
| 169 |
+
msgs.append({"role": "user", "content": "Where is China in the image"})
|
| 170 |
+
|
| 171 |
+
inputs = {"image": im_64, "question": json.dumps(msgs)}
|
| 172 |
+
answer = chat_model.chat(inputs)
|
| 173 |
+
print(answer)
|
| 174 |
+
```
|
| 175 |
+
|
| 176 |
+
|
| 177 |
+
### Inference on Mac
|
| 178 |
+
<details>
|
| 179 |
+
<summary>Click to view example, MiniCPM-V 1.0 can run on Mac with MPS (Apple silicon or AMD GPUs). </summary>
|
| 180 |
+
|
| 181 |
+
```python
|
| 182 |
+
# test.py
|
| 183 |
+
import torch
|
| 184 |
+
from PIL import Image
|
| 185 |
+
from transformers import AutoModel, AutoTokenizer
|
| 186 |
+
|
| 187 |
+
model = AutoModel.from_pretrained('openbmb/MiniCPM-V', trust_remote_code=True, torch_dtype=torch.bfloat16)
|
| 188 |
+
model = model.to(device='mps', dtype=torch.float16)
|
| 189 |
+
|
| 190 |
+
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V', trust_remote_code=True)
|
| 191 |
+
model.eval()
|
| 192 |
+
|
| 193 |
+
image = Image.open('./assets/worldmap_ck.jpg').convert('RGB')
|
| 194 |
+
question = 'What is interesting about this image?'
|
| 195 |
+
msgs = [{'role': 'user', 'content': question}]
|
| 196 |
+
|
| 197 |
+
answer, context, _ = model.chat(
|
| 198 |
+
image=image,
|
| 199 |
+
msgs=msgs,
|
| 200 |
+
context=None,
|
| 201 |
+
tokenizer=tokenizer,
|
| 202 |
+
sampling=True
|
| 203 |
+
)
|
| 204 |
+
print(answer)
|
| 205 |
+
```
|
| 206 |
+
Run with command:
|
| 207 |
+
```shell
|
| 208 |
+
PYTORCH_ENABLE_MPS_FALLBACK=1 python test.py
|
| 209 |
+
```
|
| 210 |
+
</details>
|
| 211 |
+
|
| 212 |
+
### Deployment on Mobile Phone
|
| 213 |
+
|
| 214 |
+
Currently MiniCPM-V 1.0 can be deployed on mobile phones with Android and Harmony operating systems. 🚀 Try it out [here](https://github.com/OpenBMB/mlc-MiniCPM).
|
r1-a/response_generation/minicpm/MiniCPM-o/docs/minicpm_v2.md
ADDED
|
@@ -0,0 +1,299 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
## MiniCPM-V 2.0
|
| 2 |
+
|
| 3 |
+
|
| 4 |
+
> Archive at:2025-01-13
|
| 5 |
+
|
| 6 |
+
|
| 7 |
+
|
| 8 |
+
**MiniCPM-V 2.0** is an efficient version with promising performance for deployment. The model is built based on SigLip-400M and [MiniCPM-2.4B](https://github.com/OpenBMB/MiniCPM/), connected by a perceiver resampler. Our latest version, MiniCPM-V 2.0 has several notable features.
|
| 9 |
+
|
| 10 |
+
- 🔥 **State-of-the-art Performance.**
|
| 11 |
+
|
| 12 |
+
MiniCPM-V 2.0 achieves **state-of-the-art performance** on multiple benchmarks (including OCRBench, TextVQA, MME, MMB, MathVista, etc) among models under 7B parameters. It even **outperforms strong Qwen-VL-Chat 9.6B, CogVLM-Chat 17.4B, and Yi-VL 34B on OpenCompass, a comprehensive evaluation over 11 popular benchmarks**. Notably, MiniCPM-V 2.0 shows **strong OCR capability**, achieving **comparable performance to Gemini Pro in scene-text understanding**, and **state-of-the-art performance on OCRBench** among open-source models.
|
| 13 |
+
|
| 14 |
+
- 🏆 **Trustworthy Behavior.**
|
| 15 |
+
|
| 16 |
+
LMMs are known for suffering from hallucination, often generating text not factually grounded in images. MiniCPM-V 2.0 is **the first end-side LMM aligned via multimodal RLHF for trustworthy behavior** (using the recent [RLHF-V](https://rlhf-v.github.io/) [CVPR'24] series technique). This allows the model to **match GPT-4V in preventing hallucinations** on Object HalBench.
|
| 17 |
+
|
| 18 |
+
- 🌟 **High-Resolution Images at Any Aspect Raito.**
|
| 19 |
+
|
| 20 |
+
MiniCPM-V 2.0 can accept **1.8 million pixels (e.g., 1344x1344) images at any aspect ratio**. This enables better perception of fine-grained visual information such as small objects and optical characters, which is achieved via a recent technique from [LLaVA-UHD](https://arxiv.org/pdf/2403.11703.pdf).
|
| 21 |
+
|
| 22 |
+
- ⚡️ **High Efficiency.**
|
| 23 |
+
|
| 24 |
+
MiniCPM-V 2.0 can be **efficiently deployed on most GPU cards and personal computers**, and **even on end devices such as mobile phones**. For visual encoding, we compress the image representations into much fewer tokens via a perceiver resampler. This allows MiniCPM-V 2.0 to operate with **favorable memory cost and speed during inference even when dealing with high-resolution images**.
|
| 25 |
+
|
| 26 |
+
- 🙌 **Bilingual Support.**
|
| 27 |
+
|
| 28 |
+
MiniCPM-V 2.0 **supports strong bilingual multimodal capabilities in both English and Chinese**. This is enabled by generalizing multimodal capabilities across languages, a technique from [VisCPM](https://arxiv.org/abs/2308.12038) [ICLR'24].
|
| 29 |
+
|
| 30 |
+
|
| 31 |
+
### Evaluation <!-- omit in toc -->
|
| 32 |
+
|
| 33 |
+
<div align="center">
|
| 34 |
+
<img src=../assets/minicpmv-2-peformance.png width=66% />
|
| 35 |
+
</div>
|
| 36 |
+
<details>
|
| 37 |
+
<summary>Click to view results on TextVQA, DocVQA, OCRBench, OpenCompass, MME, MMBench, MMMU, MathVista, LLaVA Bench, Object HalBench. </summary>
|
| 38 |
+
<div align="center">
|
| 39 |
+
|
| 40 |
+
<table style="margin: 0px auto;">
|
| 41 |
+
<thead>
|
| 42 |
+
<tr>
|
| 43 |
+
<th align="left">Model</th>
|
| 44 |
+
<th>Size</th>
|
| 45 |
+
<th>TextVQA val</th>
|
| 46 |
+
<th>DocVQA test</th>
|
| 47 |
+
<th>OCRBench</th>
|
| 48 |
+
<th>OpenCompass</th>
|
| 49 |
+
<th nowrap="nowrap" >MME</th>
|
| 50 |
+
<th>MMB dev(en)</th>
|
| 51 |
+
<th>MMB dev(zh)</th>
|
| 52 |
+
<th>MMMU val</th>
|
| 53 |
+
<th>MathVista</th>
|
| 54 |
+
<th>LLaVA Bench</th>
|
| 55 |
+
<th nowrap="nowrap">Object HalBench</th>
|
| 56 |
+
</tr>
|
| 57 |
+
</thead>
|
| 58 |
+
<tbody align="center">
|
| 59 |
+
<tr>
|
| 60 |
+
<td colspan="12" align="left"><strong>Proprietary models</strong></td>
|
| 61 |
+
</tr>
|
| 62 |
+
<tr>
|
| 63 |
+
<td nowrap="nowrap" align="left">Gemini Pro Vision</td>
|
| 64 |
+
<td>- </td>
|
| 65 |
+
<td>74.6</td>
|
| 66 |
+
<td>88.1</td>
|
| 67 |
+
<td>680</td>
|
| 68 |
+
<td>63.8</td>
|
| 69 |
+
<td>2148.9</td>
|
| 70 |
+
<td>75.2</td>
|
| 71 |
+
<td>74.0</td>
|
| 72 |
+
<td>48.9</td>
|
| 73 |
+
<td>45.8</td>
|
| 74 |
+
<td>79.9</td>
|
| 75 |
+
<td>- </td>
|
| 76 |
+
</tr>
|
| 77 |
+
<tr>
|
| 78 |
+
<td nowrap="nowrap" align="left">GPT-4V</td>
|
| 79 |
+
<td>- </td>
|
| 80 |
+
<td>78.0</td>
|
| 81 |
+
<td>88.4</td>
|
| 82 |
+
<td>645</td>
|
| 83 |
+
<td>63.2</td>
|
| 84 |
+
<td>1771.5</td>
|
| 85 |
+
<td>75.1</td>
|
| 86 |
+
<td>75.0</td>
|
| 87 |
+
<td>53.8</td>
|
| 88 |
+
<td>47.8</td>
|
| 89 |
+
<td>93.1</td>
|
| 90 |
+
<td>86.4 / 92.7</td>
|
| 91 |
+
</tr>
|
| 92 |
+
<tr>
|
| 93 |
+
<td colspan="12" align="left"><strong>Open-source models 6B~34B</strong></td>
|
| 94 |
+
</tr>
|
| 95 |
+
<tr>
|
| 96 |
+
<td nowrap="nowrap" align="left" >Yi-VL-6B</td>
|
| 97 |
+
<td align="right" >6.7B</td>
|
| 98 |
+
<td>45.5*</td>
|
| 99 |
+
<td>17.1*</td>
|
| 100 |
+
<td>290</td>
|
| 101 |
+
<td>49.3</td>
|
| 102 |
+
<td>1915.1 </td>
|
| 103 |
+
<td>68.6 </td>
|
| 104 |
+
<td>68.3 </td>
|
| 105 |
+
<td>40.3 </td>
|
| 106 |
+
<td>28.8 </td>
|
| 107 |
+
<td>51.9 </td>
|
| 108 |
+
<td>- </td>
|
| 109 |
+
</tr>
|
| 110 |
+
<tr>
|
| 111 |
+
<td nowrap="nowrap" align="left" >Qwen-VL-Chat</td>
|
| 112 |
+
<td align="right" >9.6B</td>
|
| 113 |
+
<td>61.5</td>
|
| 114 |
+
<td>62.6</td>
|
| 115 |
+
<td>488 </td>
|
| 116 |
+
<td>52.1 </td>
|
| 117 |
+
<td>1860.0 </td>
|
| 118 |
+
<td>60.6 </td>
|
| 119 |
+
<td>56.7 </td>
|
| 120 |
+
<td>37.0 </td>
|
| 121 |
+
<td>33.8 </td>
|
| 122 |
+
<td>67.7 </td>
|
| 123 |
+
<td>56.2 / 80.0</td>
|
| 124 |
+
</tr>
|
| 125 |
+
<tr>
|
| 126 |
+
<td nowrap="nowrap" align="left" >Yi-VL-34B</td>
|
| 127 |
+
<td align="right" >34B</td>
|
| 128 |
+
<td>43.4*</td>
|
| 129 |
+
<td>16.9*</td>
|
| 130 |
+
<td>290</td>
|
| 131 |
+
<td>52.6 </td>
|
| 132 |
+
<td>2050.2</td>
|
| 133 |
+
<td>71.1</td>
|
| 134 |
+
<td>71.4</td>
|
| 135 |
+
<td>45.1</td>
|
| 136 |
+
<td>30.7</td>
|
| 137 |
+
<td>62.3</td>
|
| 138 |
+
<td>- </td>
|
| 139 |
+
</tr>
|
| 140 |
+
<tr>
|
| 141 |
+
<td nowrap="nowrap" align="left" >DeepSeek-VL-7B</td>
|
| 142 |
+
<td align="right" >7.3B</td>
|
| 143 |
+
<td>64.7*</td>
|
| 144 |
+
<td>47.0* </td>
|
| 145 |
+
<td>435</td>
|
| 146 |
+
<td>55.6 </td>
|
| 147 |
+
<td>1765.4 </td>
|
| 148 |
+
<td>74.1 </td>
|
| 149 |
+
<td>72.8 </td>
|
| 150 |
+
<td>38.3 </td>
|
| 151 |
+
<td>36.8</td>
|
| 152 |
+
<td>77.8 </td>
|
| 153 |
+
<td>- </td>
|
| 154 |
+
</tr>
|
| 155 |
+
<tr>
|
| 156 |
+
<td nowrap="nowrap" align="left" >TextMonkey</td>
|
| 157 |
+
<td align="right" >9.7B</td>
|
| 158 |
+
<td>64.3</td>
|
| 159 |
+
<td>66.7 </td>
|
| 160 |
+
<td>558</td>
|
| 161 |
+
<td>- </td>
|
| 162 |
+
<td>- </td>
|
| 163 |
+
<td>- </td>
|
| 164 |
+
<td>- </td>
|
| 165 |
+
<td>- </td>
|
| 166 |
+
<td>-</td>
|
| 167 |
+
<td>- </td>
|
| 168 |
+
<td>- </td>
|
| 169 |
+
</tr>
|
| 170 |
+
<tr>
|
| 171 |
+
<td nowrap="nowrap" align="left" >CogVLM-Chat</td>
|
| 172 |
+
<td align="right" >17.4B</td>
|
| 173 |
+
<td>70.4</td>
|
| 174 |
+
<td>33.3*</td>
|
| 175 |
+
<td>590 </td>
|
| 176 |
+
<td>52.5 </td>
|
| 177 |
+
<td>1736.6 </td>
|
| 178 |
+
<td>63.7 </td>
|
| 179 |
+
<td>53.8 </td>
|
| 180 |
+
<td>37.3 </td>
|
| 181 |
+
<td>34.7 </td>
|
| 182 |
+
<td>73.9 </td>
|
| 183 |
+
<td>73.6 / 87.4 </td>
|
| 184 |
+
</tr>
|
| 185 |
+
<tr>
|
| 186 |
+
<td colspan="12" align="left"><strong>Open-source models 1B~3B </strong></td>
|
| 187 |
+
</tr>
|
| 188 |
+
<tr>
|
| 189 |
+
<td nowrap="nowrap" align="left" >DeepSeek-VL-1.3B</td>
|
| 190 |
+
<td align="right" >1.7B</td>
|
| 191 |
+
<td>58.4*</td>
|
| 192 |
+
<td>37.9*</td>
|
| 193 |
+
<td>413</td>
|
| 194 |
+
<td>46.0 </td>
|
| 195 |
+
<td>1531.6 </td>
|
| 196 |
+
<td>64.0 </td>
|
| 197 |
+
<td>61.2 </td>
|
| 198 |
+
<td>33.8 </td>
|
| 199 |
+
<td>29.4 </td>
|
| 200 |
+
<td>51.1 </td>
|
| 201 |
+
<td>- </td>
|
| 202 |
+
</tr>
|
| 203 |
+
<tr>
|
| 204 |
+
<td nowrap="nowrap" align="left" >MobileVLM V2</td>
|
| 205 |
+
<td align="right" >3.1B</td>
|
| 206 |
+
<td>57.5</td>
|
| 207 |
+
<td>19.4*</td>
|
| 208 |
+
<td>-</td>
|
| 209 |
+
<td>-</td>
|
| 210 |
+
<td>1440.5(P) </td>
|
| 211 |
+
<td>63.2 </td>
|
| 212 |
+
<td>-</td>
|
| 213 |
+
<td>-</td>
|
| 214 |
+
<td>-</td>
|
| 215 |
+
<td>-</td>
|
| 216 |
+
<td>-</td>
|
| 217 |
+
</tr>
|
| 218 |
+
<tr>
|
| 219 |
+
<td nowrap="nowrap" align="left" >Mini-Gemini</td>
|
| 220 |
+
<td align="right" >2.2B</td>
|
| 221 |
+
<td>56.2</td>
|
| 222 |
+
<td>34.2*</td>
|
| 223 |
+
<td>-</td>
|
| 224 |
+
<td>-</td>
|
| 225 |
+
<td>1653.0 </td>
|
| 226 |
+
<td>59.8 </td>
|
| 227 |
+
<td>- </td>
|
| 228 |
+
<td>31.7 </td>
|
| 229 |
+
<td>-</td>
|
| 230 |
+
<td>- </td>
|
| 231 |
+
<td>- </td>
|
| 232 |
+
</tr>
|
| 233 |
+
<tr>
|
| 234 |
+
<td nowrap="nowrap" align="left" >MiniCPM-V</td>
|
| 235 |
+
<td align="right" >2.8B </td>
|
| 236 |
+
<td>60.6</td>
|
| 237 |
+
<td>38.2 </td>
|
| 238 |
+
<td>366</td>
|
| 239 |
+
<td>47.6</td>
|
| 240 |
+
<td>1650.2 </td>
|
| 241 |
+
<td>67.9 </td>
|
| 242 |
+
<td>65.3 </td>
|
| 243 |
+
<td><strong>38.3</strong></td>
|
| 244 |
+
<td>28.9</td>
|
| 245 |
+
<td>51.3 </td>
|
| 246 |
+
<td>78.4 / 88.5 </td>
|
| 247 |
+
</tr>
|
| 248 |
+
<tr>
|
| 249 |
+
<td nowrap="nowrap" align="left" ><strong>MiniCPM-V 2.0</strong></td>
|
| 250 |
+
<td align="right" >2.8B </td>
|
| 251 |
+
<td><strong>74.1</strong></td>
|
| 252 |
+
<td><strong>71.9</strong> </td>
|
| 253 |
+
<td><strong>605</strong></td>
|
| 254 |
+
<td><strong>55.0</strong></td>
|
| 255 |
+
<td><strong>1808.6</strong> </td>
|
| 256 |
+
<td><strong>69.6</strong> </td>
|
| 257 |
+
<td><strong>68.1</strong> </td>
|
| 258 |
+
<td>38.2 </td>
|
| 259 |
+
<td><strong>38.7</strong></td>
|
| 260 |
+
<td><strong>69.2</strong> </td>
|
| 261 |
+
<td><strong>85.5 / 92.2 </strong></td>
|
| 262 |
+
</tr>
|
| 263 |
+
</tbody>
|
| 264 |
+
</table>
|
| 265 |
+
|
| 266 |
+
</div>
|
| 267 |
+
* We evaluate the officially released checkpoint by ourselves.
|
| 268 |
+
</details>
|
| 269 |
+
|
| 270 |
+
### Examples <!-- omit in toc -->
|
| 271 |
+
|
| 272 |
+
<table align="center">
|
| 273 |
+
<p align="center">
|
| 274 |
+
<img src="../assets/minicpmv2-cases_2.png" width=95%/>
|
| 275 |
+
</p>
|
| 276 |
+
</table>
|
| 277 |
+
|
| 278 |
+
We deploy MiniCPM-V 2.0 on end devices. The demo video is the raw screen recording on a Xiaomi 14 Pro without edition.
|
| 279 |
+
|
| 280 |
+
<table align="center">
|
| 281 |
+
<p align="center">
|
| 282 |
+
<img src="../assets/gif_cases/station.gif" width=36%/>
|
| 283 |
+
<img src="../assets/gif_cases/london_car.gif" width=36%/>
|
| 284 |
+
</p>
|
| 285 |
+
</table>
|
| 286 |
+
|
| 287 |
+
|
| 288 |
+
|
| 289 |
+
### Model Zoo
|
| 290 |
+
|
| 291 |
+
| Model | Device | Memory |          Description | Download |
|
| 292 |
+
|:-----------|:--:|:-----------:|:-------------------|:---------------:|
|
| 293 |
+
| MiniCPM-V 2.0 | GPU | 8 GB | Light version, balance the performance the computation cost. | [🤗](https://huggingface.co/openbmb/MiniCPM-V-2) [<img src="../assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-V-2) |
|
| 294 |
+
| MiniCPM-V 1.0 | GPU | 7 GB | Lightest version, achieving the fastest inference. | [🤗](https://huggingface.co/openbmb/MiniCPM-V) [<img src="../assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-V) |
|
| 295 |
+
|
| 296 |
+
|
| 297 |
+
### Deployment on Mobile Phone
|
| 298 |
+
|
| 299 |
+
MiniCPM-V 2.0 can be deployed on mobile phones with Android operating systems. 🚀 Click [MiniCPM-V 2.0](https://github.com/OpenBMB/mlc-MiniCPM) to install apk.
|
r1-a/response_generation/minicpm/MiniCPM-o/docs/minicpm_v2dot6.md
ADDED
|
@@ -0,0 +1,945 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
## MiniCPM-V 2.6
|
| 2 |
+
|
| 3 |
+
> Archieve at: 2025-01-13
|
| 4 |
+
|
| 5 |
+
**MiniCPM-V 2.6** is the latest and most capable model in the MiniCPM-V series. The model is built on SigLip-400M and Qwen2-7B with a total of 8B parameters. It exhibits a significant performance improvement over MiniCPM-Llama3-V 2.5, and introduces new features for multi-image and video understanding. Notable features of MiniCPM-V 2.6 include:
|
| 6 |
+
|
| 7 |
+
- 🔥 **Leading Performance.**
|
| 8 |
+
MiniCPM-V 2.6 achieves an average score of 65.2 on the latest version of OpenCompass, a comprehensive evaluation over 8 popular benchmarks. **With only 8B parameters, it surpasses widely used proprietary models like GPT-4o mini, GPT-4V, Gemini 1.5 Pro, and Claude 3.5 Sonnet** for single image understanding.
|
| 9 |
+
|
| 10 |
+
- 🖼️ **Multi Image Understanding and In-context Learning.** MiniCPM-V 2.6 can also perform **conversation and reasoning over multiple images**. It achieves **state-of-the-art performance** on popular multi-image benchmarks such as Mantis-Eval, BLINK, Mathverse mv and Sciverse mv, and also shows promising in-context learning capability.
|
| 11 |
+
|
| 12 |
+
- 🎬 **Video Understanding.** MiniCPM-V 2.6 can also **accept video inputs**, performing conversation and providing dense captions for spatial-temporal information. It outperforms **GPT-4V, Claude 3.5 Sonnet and LLaVA-NeXT-Video-34B** on Video-MME with/without subtitles.
|
| 13 |
+
|
| 14 |
+
- 💪 **Strong OCR Capability and Others.**
|
| 15 |
+
MiniCPM-V 2.6 can process images with any aspect ratio and up to 1.8 million pixels (e.g., 1344x1344). It achieves **state-of-the-art performance on OCRBench, surpassing proprietary models such as GPT-4o, GPT-4V, and Gemini 1.5 Pro**.
|
| 16 |
+
Based on the the latest [RLAIF-V](https://github.com/RLHF-V/RLAIF-V/) and [VisCPM](https://github.com/OpenBMB/VisCPM) techniques, it features **trustworthy behaviors**, with significantly lower hallucination rates than GPT-4o and GPT-4V on Object HalBench, and supports **multilingual capabilities** on English, Chinese, German, French, Italian, Korean, etc.
|
| 17 |
+
|
| 18 |
+
|
| 19 |
+
- 🚀 **Superior Efficiency.**
|
| 20 |
+
In addition to its friendly size, MiniCPM-V 2.6 also shows **state-of-the-art token density** (i.e., number of pixels encoded into each visual token). **It produces only 640 tokens when processing a 1.8M pixel image, which is 75% fewer than most models**. This directly improves the inference speed, first-token latency, memory usage, and power consumption. As a result, MiniCPM-V 2.6 can efficiently support **real-time video understanding** on end-side devices such as iPad.
|
| 21 |
+
|
| 22 |
+
- 💫 **Easy Usage.**
|
| 23 |
+
MiniCPM-V 2.6 can be easily used in various ways: (1) [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpmv-main/examples/llava/README-minicpmv2.6.md) and [ollama](https://github.com/OpenBMB/ollama/blob/minicpm-v2.6/examples/minicpm-v2.6/README.md) support for efficient CPU inference on local devices, (2) [int4](https://huggingface.co/openbmb/MiniCPM-V-2_6-int4) and [GGUF](https://huggingface.co/openbmb/MiniCPM-V-2_6-gguf) format quantized models in 16 sizes, (3) [vLLM](#inference-with-vllm) support for high-throughput and memory-efficient inference, (4) fine-tuning on new domains and tasks, (5) quick local WebUI demo setup with [Gradio](#chat-with-our-demo-on-gradio), and (6) online web [demo](http://120.92.209.146:8887/).
|
| 24 |
+
|
| 25 |
+
### Evaluation <!-- omit in toc -->
|
| 26 |
+
<div align="center">
|
| 27 |
+
<img src=../assets/radar_final.png width=66% />
|
| 28 |
+
</div>
|
| 29 |
+
|
| 30 |
+
<details>
|
| 31 |
+
<summary>Click to view single image results on OpenCompass, MME, MMVet, OCRBench, MMMU, MathVista, MMB, AI2D, TextVQA, DocVQA, HallusionBench, Object HalBench. </summary>
|
| 32 |
+
<div align="center">
|
| 33 |
+
|
| 34 |
+
<table style="margin: 0px auto;">
|
| 35 |
+
<thead>
|
| 36 |
+
<tr>
|
| 37 |
+
<th align="left">Model</th>
|
| 38 |
+
<th>Size</th>
|
| 39 |
+
<th>Token Density<sup>+</sup></th>
|
| 40 |
+
<th>OpenCompass</th>
|
| 41 |
+
<th>MME</th>
|
| 42 |
+
<th>MMVet</th>
|
| 43 |
+
<th>OCRBench</th>
|
| 44 |
+
<th>MMMU val</th>
|
| 45 |
+
<th>MathVista mini</th>
|
| 46 |
+
<th>MMB1.1 test</th>
|
| 47 |
+
<th>AI2D</th>
|
| 48 |
+
<th>TextVQA val</th>
|
| 49 |
+
<th>DocVQA test</th>
|
| 50 |
+
<th>HallusionBench</th>
|
| 51 |
+
<th>Object HalBench</th>
|
| 52 |
+
</tr>
|
| 53 |
+
</thead>
|
| 54 |
+
<tbody align="center">
|
| 55 |
+
<tr>
|
| 56 |
+
<td colspan="15" align="left"><strong>Proprietary</strong></td>
|
| 57 |
+
</tr>
|
| 58 |
+
<tr>
|
| 59 |
+
<td nowrap="nowrap" align="left">GPT-4o</td>
|
| 60 |
+
<td>-</td>
|
| 61 |
+
<td>1088</td>
|
| 62 |
+
<td>69.9</td>
|
| 63 |
+
<td>2328.7</td>
|
| 64 |
+
<td>69.1</td>
|
| 65 |
+
<td>736</td>
|
| 66 |
+
<td>69.2</td>
|
| 67 |
+
<td>61.3</td>
|
| 68 |
+
<td>82.2</td>
|
| 69 |
+
<td>84.6</td>
|
| 70 |
+
<td>-</td>
|
| 71 |
+
<td>92.8</td>
|
| 72 |
+
<td>55.0</td>
|
| 73 |
+
<td>17.6</td>
|
| 74 |
+
</tr>
|
| 75 |
+
<tr>
|
| 76 |
+
<td nowrap="nowrap" align="left">Claude 3.5 Sonnet</td>
|
| 77 |
+
<td>-</td>
|
| 78 |
+
<td>750</td>
|
| 79 |
+
<td>67.9</td>
|
| 80 |
+
<td>1920.0</td>
|
| 81 |
+
<td>66.0</td>
|
| 82 |
+
<td>788</td>
|
| 83 |
+
<td>65.9</td>
|
| 84 |
+
<td>61.6</td>
|
| 85 |
+
<td>78.5</td>
|
| 86 |
+
<td>80.2</td>
|
| 87 |
+
<td>-</td>
|
| 88 |
+
<td>95.2</td>
|
| 89 |
+
<td>49.9</td>
|
| 90 |
+
<td>13.8</td>
|
| 91 |
+
</tr>
|
| 92 |
+
<tr>
|
| 93 |
+
<td nowrap="nowrap" align="left">Gemini 1.5 Pro</td>
|
| 94 |
+
<td>-</td>
|
| 95 |
+
<td>-</td>
|
| 96 |
+
<td>64.4</td>
|
| 97 |
+
<td>2110.6</td>
|
| 98 |
+
<td>64.0</td>
|
| 99 |
+
<td>754</td>
|
| 100 |
+
<td>60.6</td>
|
| 101 |
+
<td>57.7</td>
|
| 102 |
+
<td>73.9</td>
|
| 103 |
+
<td>79.1</td>
|
| 104 |
+
<td>73.5</td>
|
| 105 |
+
<td>86.5</td>
|
| 106 |
+
<td>45.6</td>
|
| 107 |
+
<td>-</td>
|
| 108 |
+
</tr>
|
| 109 |
+
<tr>
|
| 110 |
+
<td nowrap="nowrap" align="left">GPT-4o mini</td>
|
| 111 |
+
<td>-</td>
|
| 112 |
+
<td>1088</td>
|
| 113 |
+
<td>64.1</td>
|
| 114 |
+
<td>2003.4</td>
|
| 115 |
+
<td>66.9</td>
|
| 116 |
+
<td>785</td>
|
| 117 |
+
<td>60.0</td>
|
| 118 |
+
<td>52.4</td>
|
| 119 |
+
<td>76.0</td>
|
| 120 |
+
<td>77.8</td>
|
| 121 |
+
<td>-</td>
|
| 122 |
+
<td>-</td>
|
| 123 |
+
<td>46.1</td>
|
| 124 |
+
<td>12.4</td>
|
| 125 |
+
</tr>
|
| 126 |
+
<tr>
|
| 127 |
+
<td nowrap="nowrap" align="left">GPT-4V</td>
|
| 128 |
+
<td>-</td>
|
| 129 |
+
<td>1088</td>
|
| 130 |
+
<td>63.5</td>
|
| 131 |
+
<td>2070.2</td>
|
| 132 |
+
<td>67.5</td>
|
| 133 |
+
<td>656</td>
|
| 134 |
+
<td>61.7</td>
|
| 135 |
+
<td>54.7</td>
|
| 136 |
+
<td>79.8</td>
|
| 137 |
+
<td>78.6</td>
|
| 138 |
+
<td>78.0</td>
|
| 139 |
+
<td>87.2</td>
|
| 140 |
+
<td>43.9</td>
|
| 141 |
+
<td>14.2</td>
|
| 142 |
+
</tr>
|
| 143 |
+
<tr>
|
| 144 |
+
<td nowrap="nowrap" align="left">Step-1V</td>
|
| 145 |
+
<td>-</td>
|
| 146 |
+
<td>-</td>
|
| 147 |
+
<td>59.5</td>
|
| 148 |
+
<td>2206.4</td>
|
| 149 |
+
<td>63.3</td>
|
| 150 |
+
<td>625</td>
|
| 151 |
+
<td>49.9</td>
|
| 152 |
+
<td>44.8</td>
|
| 153 |
+
<td>78.0</td>
|
| 154 |
+
<td>79.2</td>
|
| 155 |
+
<td>71.6</td>
|
| 156 |
+
<td>-</td>
|
| 157 |
+
<td>48.4</td>
|
| 158 |
+
<td>-</td>
|
| 159 |
+
</tr>
|
| 160 |
+
<tr>
|
| 161 |
+
<td nowrap="nowrap" align="left">Qwen-VL-Max</td>
|
| 162 |
+
<td>-</td>
|
| 163 |
+
<td>784</td>
|
| 164 |
+
<td>58.3</td>
|
| 165 |
+
<td>2281.7</td>
|
| 166 |
+
<td>61.8</td>
|
| 167 |
+
<td>684</td>
|
| 168 |
+
<td>52.0</td>
|
| 169 |
+
<td>43.4</td>
|
| 170 |
+
<td>74.6</td>
|
| 171 |
+
<td>75.7</td>
|
| 172 |
+
<td>79.5</td>
|
| 173 |
+
<td>93.1</td>
|
| 174 |
+
<td>41.2</td>
|
| 175 |
+
<td>13.4</td>
|
| 176 |
+
</tr>
|
| 177 |
+
<tr>
|
| 178 |
+
<td colspan="15" align="left"><strong>Open-source</strong></td>
|
| 179 |
+
</tr>
|
| 180 |
+
<tr>
|
| 181 |
+
<td nowrap="nowrap" align="left">LLaVA-NeXT-Yi-34B</td>
|
| 182 |
+
<td>34B</td>
|
| 183 |
+
<td>157</td>
|
| 184 |
+
<td>55.0</td>
|
| 185 |
+
<td>2006.5</td>
|
| 186 |
+
<td>50.7</td>
|
| 187 |
+
<td>574</td>
|
| 188 |
+
<td>48.8</td>
|
| 189 |
+
<td>40.4</td>
|
| 190 |
+
<td>77.8</td>
|
| 191 |
+
<td>78.9</td>
|
| 192 |
+
<td>69.3</td>
|
| 193 |
+
<td>-</td>
|
| 194 |
+
<td>34.8</td>
|
| 195 |
+
<td>12.6</td>
|
| 196 |
+
</tr>
|
| 197 |
+
<tr>
|
| 198 |
+
<td nowrap="nowrap" align="left">Mini-Gemini-HD-34B</td>
|
| 199 |
+
<td>34B</td>
|
| 200 |
+
<td>157</td>
|
| 201 |
+
<td>-</td>
|
| 202 |
+
<td>2141.0</td>
|
| 203 |
+
<td>59.3</td>
|
| 204 |
+
<td>518</td>
|
| 205 |
+
<td>48.0</td>
|
| 206 |
+
<td>43.3</td>
|
| 207 |
+
<td>-</td>
|
| 208 |
+
<td>80.5</td>
|
| 209 |
+
<td>74.1</td>
|
| 210 |
+
<td>78.9</td>
|
| 211 |
+
<td>-</td>
|
| 212 |
+
<td>-</td>
|
| 213 |
+
</tr>
|
| 214 |
+
<tr>
|
| 215 |
+
<td nowrap="nowrap" align="left">Cambrian-34B</td>
|
| 216 |
+
<td>34B</td>
|
| 217 |
+
<td>1820</td>
|
| 218 |
+
<td>58.3</td>
|
| 219 |
+
<td>2049.9</td>
|
| 220 |
+
<td>53.2</td>
|
| 221 |
+
<td>591</td>
|
| 222 |
+
<td>50.4</td>
|
| 223 |
+
<td>50.3</td>
|
| 224 |
+
<td>77.8</td>
|
| 225 |
+
<td>79.5</td>
|
| 226 |
+
<td>76.7</td>
|
| 227 |
+
<td>75.5</td>
|
| 228 |
+
<td>41.6</td>
|
| 229 |
+
<td>14.7</td>
|
| 230 |
+
</tr>
|
| 231 |
+
<tr>
|
| 232 |
+
<td nowrap="nowrap" align="left">GLM-4V-9B</td>
|
| 233 |
+
<td>13B</td>
|
| 234 |
+
<td>784</td>
|
| 235 |
+
<td>59.1</td>
|
| 236 |
+
<td>2018.8</td>
|
| 237 |
+
<td>58.0</td>
|
| 238 |
+
<td>776</td>
|
| 239 |
+
<td>46.9</td>
|
| 240 |
+
<td>51.1</td>
|
| 241 |
+
<td>67.9</td>
|
| 242 |
+
<td>71.2</td>
|
| 243 |
+
<td>-</td>
|
| 244 |
+
<td>-</td>
|
| 245 |
+
<td>45.0</td>
|
| 246 |
+
<td>-</td>
|
| 247 |
+
</tr>
|
| 248 |
+
<tr>
|
| 249 |
+
<td nowrap="nowrap" align="left">InternVL2-8B</td>
|
| 250 |
+
<td>8B</td>
|
| 251 |
+
<td>706</td>
|
| 252 |
+
<td>64.1</td>
|
| 253 |
+
<td>2215.1</td>
|
| 254 |
+
<td>54.3</td>
|
| 255 |
+
<td>794</td>
|
| 256 |
+
<td><strong>51.2</strong></td>
|
| 257 |
+
<td>58.3</td>
|
| 258 |
+
<td><strong>79.4</strong></td>
|
| 259 |
+
<td><strong>83.6</strong></td>
|
| 260 |
+
<td>77.4</td>
|
| 261 |
+
<td><strong>91.6</strong></td>
|
| 262 |
+
<td>45.0</td>
|
| 263 |
+
<td>21.3</td>
|
| 264 |
+
</tr>
|
| 265 |
+
<tr>
|
| 266 |
+
<td nowrap="nowrap" align="left">MiniCPM-Llama-V 2.5</td>
|
| 267 |
+
<td>8B</td>
|
| 268 |
+
<td>1882</td>
|
| 269 |
+
<td>58.8</td>
|
| 270 |
+
<td>2024.6</td>
|
| 271 |
+
<td>52.8</td>
|
| 272 |
+
<td>725</td>
|
| 273 |
+
<td>45.8</td>
|
| 274 |
+
<td>54.3</td>
|
| 275 |
+
<td>72.0</td>
|
| 276 |
+
<td>78.4</td>
|
| 277 |
+
<td>76.6</td>
|
| 278 |
+
<td>84.8</td>
|
| 279 |
+
<td>42.4</td>
|
| 280 |
+
<td>10.3</td>
|
| 281 |
+
</tr>
|
| 282 |
+
<tr style="background-color: #e6f2ff;">
|
| 283 |
+
<td nowrap="nowrap" align="left">MiniCPM-V 2.6</td>
|
| 284 |
+
<td>8B</td>
|
| 285 |
+
<td><strong>2822</strong></td>
|
| 286 |
+
<td><strong>65.2</strong></td>
|
| 287 |
+
<td><strong>2348.4</strong>*</td>
|
| 288 |
+
<td><strong>60.0</strong></td>
|
| 289 |
+
<td><strong>852</strong>*</td>
|
| 290 |
+
<td>49.8*</td>
|
| 291 |
+
<td><strong>60.6</strong></td>
|
| 292 |
+
<td>78.0</td>
|
| 293 |
+
<td>82.1</td>
|
| 294 |
+
<td><strong>80.1<strong></td>
|
| 295 |
+
<td>90.8</td>
|
| 296 |
+
<td><strong>48.1</strong>*</td>
|
| 297 |
+
<td><strong>8.2</strong></td>
|
| 298 |
+
</tr>
|
| 299 |
+
</tbody>
|
| 300 |
+
</table>
|
| 301 |
+
|
| 302 |
+
</div>
|
| 303 |
+
* We evaluate this benchmark using chain-of-thought prompting. Specifically, for MME, we used this technique only for the Cognition set.
|
| 304 |
+
|
| 305 |
+
<sup>+</sup> Token Density: number of pixels encoded into each visual token at maximum resolution, i.e., # pixels at maximum resolution / # visual tokens.
|
| 306 |
+
|
| 307 |
+
Note: For proprietary models, we calculate token density based on the image encoding charging strategy defined in the official API documentation, which provides an upper-bound estimation.
|
| 308 |
+
|
| 309 |
+
</details>
|
| 310 |
+
|
| 311 |
+
|
| 312 |
+
<details>
|
| 313 |
+
<summary>Click to view multi-image results on Mantis Eval, BLINK, Mathverse mv, Sciverse mv, MIRB.</summary>
|
| 314 |
+
<div align="center">
|
| 315 |
+
|
| 316 |
+
<table style="margin: 0px auto;">
|
| 317 |
+
<thead>
|
| 318 |
+
<tr>
|
| 319 |
+
<th align="left">Model</th>
|
| 320 |
+
<th>Size</th>
|
| 321 |
+
<th>Mantis Eval</th>
|
| 322 |
+
<th>BLINK val</th>
|
| 323 |
+
<th>Mathverse mv</th>
|
| 324 |
+
<th>Sciverse mv</th>
|
| 325 |
+
<th>MIRB</th>
|
| 326 |
+
</tr>
|
| 327 |
+
</thead>
|
| 328 |
+
<tbody align="center">
|
| 329 |
+
<tr>
|
| 330 |
+
<td colspan="7" align="left"><strong>Proprietary</strong></td>
|
| 331 |
+
</tr>
|
| 332 |
+
<tr>
|
| 333 |
+
<td nowrap="nowrap" align="left">GPT-4V</td>
|
| 334 |
+
<td>-</td>
|
| 335 |
+
<td>62.7</td>
|
| 336 |
+
<td>54.6</td>
|
| 337 |
+
<td>60.3</td>
|
| 338 |
+
<td>66.9</td>
|
| 339 |
+
<td>53.1</td>
|
| 340 |
+
</tr>
|
| 341 |
+
<tr>
|
| 342 |
+
<td nowrap="nowrap" align="left">LLaVA-NeXT-Interleave-14B</td>
|
| 343 |
+
<td>14B</td>
|
| 344 |
+
<td>66.4</td>
|
| 345 |
+
<td>52.6</td>
|
| 346 |
+
<td>32.7</td>
|
| 347 |
+
<td>30.2</td>
|
| 348 |
+
<td>-</td>
|
| 349 |
+
</tr>
|
| 350 |
+
<tr>
|
| 351 |
+
<td colspan="7" align="left"><strong>Open-source</strong></td>
|
| 352 |
+
</tr>
|
| 353 |
+
<tr>
|
| 354 |
+
<td nowrap="nowrap" align="left">Emu2-Chat</td>
|
| 355 |
+
<td>37B</td>
|
| 356 |
+
<td>37.8</td>
|
| 357 |
+
<td>36.2</td>
|
| 358 |
+
<td>-</td>
|
| 359 |
+
<td>27.2</td>
|
| 360 |
+
<td>-</td>
|
| 361 |
+
</tr>
|
| 362 |
+
<tr>
|
| 363 |
+
<td nowrap="nowrap" align="left">CogVLM</td>
|
| 364 |
+
<td>17B</td>
|
| 365 |
+
<td>45.2</td>
|
| 366 |
+
<td>41.1</td>
|
| 367 |
+
<td>-</td>
|
| 368 |
+
<td>-</td>
|
| 369 |
+
<td>-</td>
|
| 370 |
+
</tr>
|
| 371 |
+
<tr>
|
| 372 |
+
<td nowrap="nowrap" align="left">VPG-C</td>
|
| 373 |
+
<td>7B</td>
|
| 374 |
+
<td>52.4</td>
|
| 375 |
+
<td>43.1</td>
|
| 376 |
+
<td>24.3</td>
|
| 377 |
+
<td>23.1</td>
|
| 378 |
+
<td>-</td>
|
| 379 |
+
</tr>
|
| 380 |
+
<tr>
|
| 381 |
+
<td nowrap="nowrap" align="left">VILA 8B</td>
|
| 382 |
+
<td>8B</td>
|
| 383 |
+
<td>51.2</td>
|
| 384 |
+
<td>39.3</td>
|
| 385 |
+
<td>-</td>
|
| 386 |
+
<td>36.5</td>
|
| 387 |
+
<td>-</td>
|
| 388 |
+
</tr>
|
| 389 |
+
<tr>
|
| 390 |
+
<td nowrap="nowrap" align="left">InternLM-XComposer-2.5</td>
|
| 391 |
+
<td>8B</td>
|
| 392 |
+
<td>53.1*</td>
|
| 393 |
+
<td>48.9</td>
|
| 394 |
+
<td>32.1*</td>
|
| 395 |
+
<td>-</td>
|
| 396 |
+
<td>42.5</td>
|
| 397 |
+
</tr>
|
| 398 |
+
<tr>
|
| 399 |
+
<td nowrap="nowrap" align="left">InternVL2-8B</td>
|
| 400 |
+
<td>8B</td>
|
| 401 |
+
<td>59.0*</td>
|
| 402 |
+
<td>50.9</td>
|
| 403 |
+
<td>30.5*</td>
|
| 404 |
+
<td>34.4*</td>
|
| 405 |
+
<td><strong>56.9*</strong></td>
|
| 406 |
+
</tr>
|
| 407 |
+
<tr style="background-color: #e6f2ff;">
|
| 408 |
+
<td nowrap="nowrap" align="left">MiniCPM-V 2.6</td>
|
| 409 |
+
<td>8B</td>
|
| 410 |
+
<td><strong>69.1</strong></td>
|
| 411 |
+
<td><strong>53.0</strong></td>
|
| 412 |
+
<td><strong>84.9</strong></td>
|
| 413 |
+
<td><strong>74.9</strong></td>
|
| 414 |
+
<td>53.8</td>
|
| 415 |
+
</tr>
|
| 416 |
+
</tbody>
|
| 417 |
+
</table>
|
| 418 |
+
|
| 419 |
+
</div>
|
| 420 |
+
* We evaluate the officially released checkpoint by ourselves.
|
| 421 |
+
</details>
|
| 422 |
+
|
| 423 |
+
<details>
|
| 424 |
+
<summary>Click to view video results on Video-MME and Video-ChatGPT.</summary>
|
| 425 |
+
<div align="center">
|
| 426 |
+
<table style="margin: 0px auto;">
|
| 427 |
+
<thead>
|
| 428 |
+
<tr>
|
| 429 |
+
<th align="left">Model</th>
|
| 430 |
+
<th>Size</th>
|
| 431 |
+
<th colspan="2">Video-MME</th>
|
| 432 |
+
<th colspan="5">Video-ChatGPT</th>
|
| 433 |
+
</tr>
|
| 434 |
+
<tr>
|
| 435 |
+
<th align="left"></th>
|
| 436 |
+
<th></th>
|
| 437 |
+
<th>w/o subs</th>
|
| 438 |
+
<th>w subs</th>
|
| 439 |
+
<th>Correctness</th>
|
| 440 |
+
<th>Detail</th>
|
| 441 |
+
<th>Context</th>
|
| 442 |
+
<th>Temporal</th>
|
| 443 |
+
<th>Consistency</th>
|
| 444 |
+
</tr>
|
| 445 |
+
</thead>
|
| 446 |
+
<tbody align="center">
|
| 447 |
+
<tr>
|
| 448 |
+
<td colspan="9" align="left"><strong>Proprietary</strong></td>
|
| 449 |
+
</tr>
|
| 450 |
+
<tr>
|
| 451 |
+
<td nowrap="nowrap" align="left">Claude 3.5 Sonnet</td>
|
| 452 |
+
<td>-</td>
|
| 453 |
+
<td>60.0</td>
|
| 454 |
+
<td>62.9</td>
|
| 455 |
+
<td>-</td>
|
| 456 |
+
<td>-</td>
|
| 457 |
+
<td>-</td>
|
| 458 |
+
<td>-</td>
|
| 459 |
+
<td>-</td>
|
| 460 |
+
</tr>
|
| 461 |
+
<tr>
|
| 462 |
+
<td nowrap="nowrap" align="left">GPT-4V</td>
|
| 463 |
+
<td>-</td>
|
| 464 |
+
<td>59.9</td>
|
| 465 |
+
<td>63.3</td>
|
| 466 |
+
<td>-</td>
|
| 467 |
+
<td>-</td>
|
| 468 |
+
<td>-</td>
|
| 469 |
+
<td>-</td>
|
| 470 |
+
<td>-</td>
|
| 471 |
+
</tr>
|
| 472 |
+
<tr>
|
| 473 |
+
<td colspan="9" align="left"><strong>Open-source</strong></td>
|
| 474 |
+
</tr>
|
| 475 |
+
<tr>
|
| 476 |
+
<td nowrap="nowrap" align="left">LLaVA-NeXT-7B</td>
|
| 477 |
+
<td>7B</td>
|
| 478 |
+
<td>-</td>
|
| 479 |
+
<td>-</td>
|
| 480 |
+
<td>3.39</td>
|
| 481 |
+
<td>3.29</td>
|
| 482 |
+
<td>3.92</td>
|
| 483 |
+
<td>2.60</td>
|
| 484 |
+
<td>3.12</td>
|
| 485 |
+
</tr>
|
| 486 |
+
<tr>
|
| 487 |
+
<td nowrap="nowrap" align="left">LLaVA-NeXT-34B</td>
|
| 488 |
+
<td>34B</td>
|
| 489 |
+
<td>-</td>
|
| 490 |
+
<td>-</td>
|
| 491 |
+
<td>3.29</td>
|
| 492 |
+
<td>3.23</td>
|
| 493 |
+
<td>3.83</td>
|
| 494 |
+
<td>2.51</td>
|
| 495 |
+
<td>3.47</td>
|
| 496 |
+
</tr>
|
| 497 |
+
<tr>
|
| 498 |
+
<td nowrap="nowrap" align="left">CogVLM2-Video</td>
|
| 499 |
+
<td>12B</td>
|
| 500 |
+
<td>-</td>
|
| 501 |
+
<td>-</td>
|
| 502 |
+
<td>3.49</td>
|
| 503 |
+
<td><strong>3.46</strong></td>
|
| 504 |
+
<td>3.23</td>
|
| 505 |
+
<td><strong>2.98</strong></td>
|
| 506 |
+
<td><strong>3.64</strong></td>
|
| 507 |
+
</tr>
|
| 508 |
+
<tr>
|
| 509 |
+
<td nowrap="nowrap" align="left">LongVA</td>
|
| 510 |
+
<td>7B</td>
|
| 511 |
+
<td>52.4</td>
|
| 512 |
+
<td>54.3</td>
|
| 513 |
+
<td>3.05</td>
|
| 514 |
+
<td>3.09</td>
|
| 515 |
+
<td>3.77</td>
|
| 516 |
+
<td>2.44</td>
|
| 517 |
+
<td><strong>3.64</strong></td>
|
| 518 |
+
</tr>
|
| 519 |
+
<tr>
|
| 520 |
+
<td nowrap="nowrap" align="left">InternVL2-8B</td>
|
| 521 |
+
<td>8B</td>
|
| 522 |
+
<td>54.0</td>
|
| 523 |
+
<td>56.9</td>
|
| 524 |
+
<td>-</td>
|
| 525 |
+
<td>-</td>
|
| 526 |
+
<td>-</td>
|
| 527 |
+
<td>-</td>
|
| 528 |
+
<td>-</td>
|
| 529 |
+
</tr>
|
| 530 |
+
<tr>
|
| 531 |
+
<td nowrap="nowrap" align="left">InternLM-XComposer-2.5</td>
|
| 532 |
+
<td>8B</td>
|
| 533 |
+
<td>55.8</td>
|
| 534 |
+
<td>-</td>
|
| 535 |
+
<td>-</td>
|
| 536 |
+
<td>-</td>
|
| 537 |
+
<td>-</td>
|
| 538 |
+
<td>-</td>
|
| 539 |
+
<td>-</td>
|
| 540 |
+
</tr>
|
| 541 |
+
<tr>
|
| 542 |
+
<td nowrap="nowrap" align="left">LLaVA-NeXT-Video</td>
|
| 543 |
+
<td>32B</td>
|
| 544 |
+
<td>60.2</td>
|
| 545 |
+
<td>63.0</td>
|
| 546 |
+
<td>3.48</td>
|
| 547 |
+
<td>3.37</td>
|
| 548 |
+
<td><strong>3.95</strong></td>
|
| 549 |
+
<td>2.64</td>
|
| 550 |
+
<td>3.28</td>
|
| 551 |
+
</tr>
|
| 552 |
+
<tr style="background-color: #e6f2ff;">
|
| 553 |
+
<td nowrap="nowrap" align="left">MiniCPM-V 2.6</td>
|
| 554 |
+
<td>8B</td>
|
| 555 |
+
<td><strong>60.9</strong></td>
|
| 556 |
+
<td><strong>63.6</strong></td>
|
| 557 |
+
<td><strong>3.59</strong></td>
|
| 558 |
+
<td>3.28</td>
|
| 559 |
+
<td>3.93</td>
|
| 560 |
+
<td>2.73</td>
|
| 561 |
+
<td>3.62</td>
|
| 562 |
+
</tr>
|
| 563 |
+
</tbody>
|
| 564 |
+
</table>
|
| 565 |
+
</div>
|
| 566 |
+
</details>
|
| 567 |
+
|
| 568 |
+
|
| 569 |
+
<details>
|
| 570 |
+
<summary>Click to view few-shot results on TextVQA, VizWiz, VQAv2, OK-VQA.</summary>
|
| 571 |
+
<div align="center">
|
| 572 |
+
<table style="margin: 0px auto;">
|
| 573 |
+
<thead>
|
| 574 |
+
<tr>
|
| 575 |
+
<th align="left">Model</th>
|
| 576 |
+
<th>Size</th>
|
| 577 |
+
<th>Shot</th>
|
| 578 |
+
<th>TextVQA val</th>
|
| 579 |
+
<th>VizWiz test-dev</th>
|
| 580 |
+
<th>VQAv2 test-dev</th>
|
| 581 |
+
<th>OK-VQA val</th>
|
| 582 |
+
</tr>
|
| 583 |
+
</thead>
|
| 584 |
+
<tbody align="center">
|
| 585 |
+
<tr>
|
| 586 |
+
<td align="left" nowrap="nowrap" rowspan="3">Flamingo</td>
|
| 587 |
+
<td rowspan="3">80B</td>
|
| 588 |
+
<td>0*</td>
|
| 589 |
+
<td>35.0</td>
|
| 590 |
+
<td>31.6</td>
|
| 591 |
+
<td>56.3</td>
|
| 592 |
+
<td>40.6</td>
|
| 593 |
+
</tr>
|
| 594 |
+
<tr>
|
| 595 |
+
<td>4</td>
|
| 596 |
+
<td>36.5</td>
|
| 597 |
+
<td>39.6</td>
|
| 598 |
+
<td>63.1</td>
|
| 599 |
+
<td><strong>57.4</strong></td>
|
| 600 |
+
</tr>
|
| 601 |
+
<tr>
|
| 602 |
+
<td>8</td>
|
| 603 |
+
<td>37.3</td>
|
| 604 |
+
<td>44.8</td>
|
| 605 |
+
<td>65.6</td>
|
| 606 |
+
<td>57.5</td>
|
| 607 |
+
</tr>
|
| 608 |
+
<tr>
|
| 609 |
+
<td align="left" nowrap="nowrap" rowspan="3">IDEFICS</td>
|
| 610 |
+
<td rowspan="3">80B</td>
|
| 611 |
+
<td>0*</td>
|
| 612 |
+
<td>30.9</td>
|
| 613 |
+
<td>36.0</td>
|
| 614 |
+
<td>60.0</td>
|
| 615 |
+
<td>45.2</td>
|
| 616 |
+
</tr>
|
| 617 |
+
<tr>
|
| 618 |
+
<td>4</td>
|
| 619 |
+
<td>34.3</td>
|
| 620 |
+
<td>40.4</td>
|
| 621 |
+
<td>63.6</td>
|
| 622 |
+
<td>52.4</td>
|
| 623 |
+
</tr>
|
| 624 |
+
<tr>
|
| 625 |
+
<td>8</td>
|
| 626 |
+
<td>35.7</td>
|
| 627 |
+
<td>46.1</td>
|
| 628 |
+
<td>64.8</td>
|
| 629 |
+
<td>55.1</td>
|
| 630 |
+
</tr>
|
| 631 |
+
<tr>
|
| 632 |
+
<td align="left" nowrap="nowrap" rowspan="3">OmniCorpus</td>
|
| 633 |
+
<td rowspan="3">7B</td>
|
| 634 |
+
<td>0*</td>
|
| 635 |
+
<td>43.0</td>
|
| 636 |
+
<td>49.8</td>
|
| 637 |
+
<td>63.2</td>
|
| 638 |
+
<td>45.5</td>
|
| 639 |
+
</tr>
|
| 640 |
+
<tr>
|
| 641 |
+
<td>4</td>
|
| 642 |
+
<td>45.4</td>
|
| 643 |
+
<td>51.3</td>
|
| 644 |
+
<td>64.5</td>
|
| 645 |
+
<td>46.5</td>
|
| 646 |
+
</tr>
|
| 647 |
+
<tr>
|
| 648 |
+
<td>8</td>
|
| 649 |
+
<td>45.6</td>
|
| 650 |
+
<td>52.2</td>
|
| 651 |
+
<td>64.7</td>
|
| 652 |
+
<td>46.6</td>
|
| 653 |
+
</tr>
|
| 654 |
+
<tr>
|
| 655 |
+
<td align="left" nowrap="nowrap" rowspan="3">Emu2</td>
|
| 656 |
+
<td rowspan="3">37B</td>
|
| 657 |
+
<td>0</td>
|
| 658 |
+
<td>26.4</td>
|
| 659 |
+
<td>40.4</td>
|
| 660 |
+
<td>33.5</td>
|
| 661 |
+
<td>26.7</td>
|
| 662 |
+
</tr>
|
| 663 |
+
<tr>
|
| 664 |
+
<td>4</td>
|
| 665 |
+
<td>48.2</td>
|
| 666 |
+
<td>54.6</td>
|
| 667 |
+
<td>67.0</td>
|
| 668 |
+
<td>53.2</td>
|
| 669 |
+
</tr>
|
| 670 |
+
<tr>
|
| 671 |
+
<td>8</td>
|
| 672 |
+
<td>49.3</td>
|
| 673 |
+
<td>54.7</td>
|
| 674 |
+
<td>67.8</td>
|
| 675 |
+
<td>54.1</td>
|
| 676 |
+
</tr>
|
| 677 |
+
<tr>
|
| 678 |
+
<td align="left" nowrap="nowrap" rowspan="2">MM1</td>
|
| 679 |
+
<td rowspan="2">30B</td>
|
| 680 |
+
<td>0</td>
|
| 681 |
+
<td>26.2</td>
|
| 682 |
+
<td>40.4</td>
|
| 683 |
+
<td>48.9</td>
|
| 684 |
+
<td>26.7</td>
|
| 685 |
+
</tr>
|
| 686 |
+
<tr>
|
| 687 |
+
<td>8</td>
|
| 688 |
+
<td>49.3</td>
|
| 689 |
+
<td>54.7</td>
|
| 690 |
+
<td><strong>70.9</strong></td>
|
| 691 |
+
<td>54.1</td>
|
| 692 |
+
</tr>
|
| 693 |
+
<tr style="background-color: #e6f2ff;">
|
| 694 |
+
<td align="left" nowrap="nowrap" rowspan="3">MiniCPM-V 2.6<sup>+</sup></td>
|
| 695 |
+
<td rowspan="3">8B</td>
|
| 696 |
+
<td>0</td>
|
| 697 |
+
<td>43.9</td>
|
| 698 |
+
<td>33.8</td>
|
| 699 |
+
<td>45.4</td>
|
| 700 |
+
<td>23.9</td>
|
| 701 |
+
</tr>
|
| 702 |
+
<tr style="background-color: #e6f2ff;">
|
| 703 |
+
<td>4</td>
|
| 704 |
+
<td>63.6</td>
|
| 705 |
+
<td>60.5</td>
|
| 706 |
+
<td>65.5</td>
|
| 707 |
+
<td>50.1</td>
|
| 708 |
+
</tr>
|
| 709 |
+
<tr style="background-color: #e6f2ff;">
|
| 710 |
+
<td>8</td>
|
| 711 |
+
<td><strong>64.6</strong></td>
|
| 712 |
+
<td><strong>63.4</strong></td>
|
| 713 |
+
<td>68.2</td>
|
| 714 |
+
<td>51.4</td>
|
| 715 |
+
</tr>
|
| 716 |
+
</tbody>
|
| 717 |
+
</table>
|
| 718 |
+
|
| 719 |
+
|
| 720 |
+
</div>
|
| 721 |
+
* denotes zero image shot and two additional text shots following Flamingo.
|
| 722 |
+
|
| 723 |
+
<sup>+</sup> We evaluate the pretraining ckpt without SFT.
|
| 724 |
+
</details>
|
| 725 |
+
|
| 726 |
+
### Examples <!-- omit in toc -->
|
| 727 |
+
|
| 728 |
+
<div style="display: flex; flex-direction: column; align-items: center;">
|
| 729 |
+
<img src="../assets/minicpmv2_6/multi_img-bike.png" alt="Bike" style="margin-bottom: 5px;">
|
| 730 |
+
<img src="../assets/minicpmv2_6/multi_img-menu.png" alt="Menu" style="margin-bottom: 5px;">
|
| 731 |
+
<img src="../assets/minicpmv2_6/multi_img-code.png" alt="Code" style="margin-bottom: 5px;">
|
| 732 |
+
<img src="../assets/minicpmv2_6/ICL-Mem.png" alt="Mem" style="margin-bottom: 5px;">
|
| 733 |
+
<img src="../assets/minicpmv2_6/multiling-medal.png" alt="medal" style="margin-bottom: 10px;">
|
| 734 |
+
</div>
|
| 735 |
+
<details>
|
| 736 |
+
<summary>Click to view more cases.</summary>
|
| 737 |
+
<div style="display: flex; flex-direction: column; align-items: center;">
|
| 738 |
+
<img src="../assets/minicpmv2_6/ICL-elec.png" alt="elec" style="margin-bottom: 5px;">
|
| 739 |
+
<img src="../assets/minicpmv2_6/multiling-olympic.png" alt="Menu" style="margin-bottom: 10px;">
|
| 740 |
+
</div>
|
| 741 |
+
</details>
|
| 742 |
+
|
| 743 |
+
We deploy MiniCPM-V 2.6 on end devices. The demo video is the raw screen recording on a iPad Pro without edition.
|
| 744 |
+
|
| 745 |
+
<table align="center">
|
| 746 |
+
<p align="center">
|
| 747 |
+
<img src="../assets/gif_cases/ai.gif" width=32%/>
|
| 748 |
+
|
| 749 |
+
<img src="../assets/gif_cases/beer.gif" width=32%/>
|
| 750 |
+
</p>
|
| 751 |
+
</table>
|
| 752 |
+
|
| 753 |
+
<table align="center">
|
| 754 |
+
<p align="center">
|
| 755 |
+
<img src="../assets/gif_cases/ticket.gif" width=32%/>
|
| 756 |
+
|
| 757 |
+
<img src="../assets/gif_cases/wfh.gif" width=32%/>
|
| 758 |
+
</p>
|
| 759 |
+
</table>
|
| 760 |
+
|
| 761 |
+
<table align="center">
|
| 762 |
+
<p align="center">
|
| 763 |
+
<video src="https://github.com/user-attachments/assets/21f4b818-ede1-4822-920e-91281725c830" width="360" /> </video>
|
| 764 |
+
<!-- <video src="https://github.com/user-attachments/assets/c835f757-206b-4d9c-8e36-70d67b453628" width="360" /> </video> -->
|
| 765 |
+
</p>
|
| 766 |
+
</table>
|
| 767 |
+
|
| 768 |
+
</details>
|
| 769 |
+
|
| 770 |
+
|
| 771 |
+
|
| 772 |
+
### Multi-turn Conversation
|
| 773 |
+
|
| 774 |
+
|
| 775 |
+
<div align="center">
|
| 776 |
+
<img src="../assets/airplane.jpeg" width="500px">
|
| 777 |
+
</div>
|
| 778 |
+
|
| 779 |
+
|
| 780 |
+
```python
|
| 781 |
+
import torch
|
| 782 |
+
from PIL import Image
|
| 783 |
+
from transformers import AutoModel, AutoTokenizer
|
| 784 |
+
|
| 785 |
+
torch.manual_seed(0)
|
| 786 |
+
|
| 787 |
+
model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True,
|
| 788 |
+
attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
|
| 789 |
+
model = model.eval().cuda()
|
| 790 |
+
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True)
|
| 791 |
+
|
| 792 |
+
image = Image.open('./assets/airplane.jpeg').convert('RGB')
|
| 793 |
+
|
| 794 |
+
# First round chat
|
| 795 |
+
question = "Tell me the model of this aircraft."
|
| 796 |
+
msgs = [{'role': 'user', 'content': [image, question]}]
|
| 797 |
+
|
| 798 |
+
answer = model.chat(
|
| 799 |
+
image=None,
|
| 800 |
+
msgs=msgs,
|
| 801 |
+
tokenizer=tokenizer
|
| 802 |
+
)
|
| 803 |
+
print(answer)
|
| 804 |
+
|
| 805 |
+
# Second round chat
|
| 806 |
+
# pass history context of multi-turn conversation
|
| 807 |
+
msgs.append({"role": "assistant", "content": [answer]})
|
| 808 |
+
msgs.append({"role": "user", "content": ["Introduce something about Airbus A380."]})
|
| 809 |
+
|
| 810 |
+
answer = model.chat(
|
| 811 |
+
image=None,
|
| 812 |
+
msgs=msgs,
|
| 813 |
+
tokenizer=tokenizer
|
| 814 |
+
)
|
| 815 |
+
print(answer)
|
| 816 |
+
```
|
| 817 |
+
|
| 818 |
+
You could get the following output:
|
| 819 |
+
|
| 820 |
+
```
|
| 821 |
+
"The aircraft in the image is an Airbus A380, which can be identified by its large size, double-deck structure, and the distinctive shape of its wings and engines. The A380 is a wide-body aircraft known for being the world's largest passenger airliner, designed for long-haul flights. It has four engines, which are characteristic of large commercial aircraft. The registration number on the aircraft can also provide specific information about the model if looked up in an aviation database."
|
| 822 |
+
|
| 823 |
+
"The Airbus A380 is a double-deck, wide-body, four-engine jet airliner made by Airbus. It is the world's largest passenger airliner and is known for its long-haul capabilities. The aircraft was developed to improve efficiency and comfort for passengers traveling over long distances. It has two full-length passenger decks, which can accommodate more passengers than a typical single-aisle airplane. The A380 has been operated by airlines such as Lufthansa, Singapore Airlines, and Emirates, among others. It is widely recognized for its unique design and significant impact on the aviation industry."
|
| 824 |
+
```
|
| 825 |
+
|
| 826 |
+
#### Multi-image Understanding
|
| 827 |
+
<details>
|
| 828 |
+
<summary> Click to view Python example of MiniCPM-V 2.6 multi-image understanding </summary>
|
| 829 |
+
|
| 830 |
+
```python
|
| 831 |
+
import torch
|
| 832 |
+
from PIL import Image
|
| 833 |
+
from transformers import AutoModel, AutoTokenizer
|
| 834 |
+
|
| 835 |
+
model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True,
|
| 836 |
+
attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
|
| 837 |
+
model = model.eval().cuda()
|
| 838 |
+
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True)
|
| 839 |
+
|
| 840 |
+
image1 = Image.open('image1.jpg').convert('RGB')
|
| 841 |
+
image2 = Image.open('image2.jpg').convert('RGB')
|
| 842 |
+
question = 'Compare image 1 and image 2, tell me about the differences between image 1 and image 2.'
|
| 843 |
+
|
| 844 |
+
msgs = [{'role': 'user', 'content': [image1, image2, question]}]
|
| 845 |
+
|
| 846 |
+
answer = model.chat(
|
| 847 |
+
image=None,
|
| 848 |
+
msgs=msgs,
|
| 849 |
+
tokenizer=tokenizer
|
| 850 |
+
)
|
| 851 |
+
print(answer)
|
| 852 |
+
```
|
| 853 |
+
</details>
|
| 854 |
+
|
| 855 |
+
#### Few-shot In-Context-Learning
|
| 856 |
+
|
| 857 |
+
<details>
|
| 858 |
+
<summary> Click to view Python example of MiniCPM-V 2.6 few-shot in-context-learning example </summary>
|
| 859 |
+
|
| 860 |
+
```python
|
| 861 |
+
import torch
|
| 862 |
+
from PIL import Image
|
| 863 |
+
from transformers import AutoModel, AutoTokenizer
|
| 864 |
+
|
| 865 |
+
model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True,
|
| 866 |
+
attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
|
| 867 |
+
model = model.eval().cuda()
|
| 868 |
+
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True)
|
| 869 |
+
|
| 870 |
+
question = "production date"
|
| 871 |
+
image1 = Image.open('example1.jpg').convert('RGB')
|
| 872 |
+
answer1 = "2023.08.04"
|
| 873 |
+
image2 = Image.open('example2.jpg').convert('RGB')
|
| 874 |
+
answer2 = "2007.04.24"
|
| 875 |
+
image_test = Image.open('test.jpg').convert('RGB')
|
| 876 |
+
|
| 877 |
+
msgs = [
|
| 878 |
+
{'role': 'user', 'content': [image1, question]}, {'role': 'assistant', 'content': [answer1]},
|
| 879 |
+
{'role': 'user', 'content': [image2, question]}, {'role': 'assistant', 'content': [answer2]},
|
| 880 |
+
{'role': 'user', 'content': [image_test, question]}
|
| 881 |
+
]
|
| 882 |
+
|
| 883 |
+
answer = model.chat(
|
| 884 |
+
image=None,
|
| 885 |
+
msgs=msgs,
|
| 886 |
+
tokenizer=tokenizer
|
| 887 |
+
)
|
| 888 |
+
print(answer)
|
| 889 |
+
```
|
| 890 |
+
</details>
|
| 891 |
+
|
| 892 |
+
#### Video understanding
|
| 893 |
+
<details>
|
| 894 |
+
<summary> Click to view Python example of MiniCPM-V 2.6 video understanding </summary>
|
| 895 |
+
|
| 896 |
+
```python
|
| 897 |
+
import torch
|
| 898 |
+
from PIL import Image
|
| 899 |
+
from transformers import AutoModel, AutoTokenizer
|
| 900 |
+
from decord import VideoReader, cpu # pip install decord
|
| 901 |
+
|
| 902 |
+
model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True,
|
| 903 |
+
attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
|
| 904 |
+
model = model.eval().cuda()
|
| 905 |
+
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True)
|
| 906 |
+
|
| 907 |
+
MAX_NUM_FRAMES=64 # if cuda OOM set a smaller number
|
| 908 |
+
|
| 909 |
+
def encode_video(video_path):
|
| 910 |
+
def uniform_sample(l, n):
|
| 911 |
+
gap = len(l) / n
|
| 912 |
+
idxs = [int(i * gap + gap / 2) for i in range(n)]
|
| 913 |
+
return [l[i] for i in idxs]
|
| 914 |
+
|
| 915 |
+
vr = VideoReader(video_path, ctx=cpu(0))
|
| 916 |
+
sample_fps = round(vr.get_avg_fps() / 1) # FPS
|
| 917 |
+
frame_idx = [i for i in range(0, len(vr), sample_fps)]
|
| 918 |
+
if len(frame_idx) > MAX_NUM_FRAMES:
|
| 919 |
+
frame_idx = uniform_sample(frame_idx, MAX_NUM_FRAMES)
|
| 920 |
+
frames = vr.get_batch(frame_idx).asnumpy()
|
| 921 |
+
frames = [Image.fromarray(v.astype('uint8')) for v in frames]
|
| 922 |
+
print('num frames:', len(frames))
|
| 923 |
+
return frames
|
| 924 |
+
|
| 925 |
+
video_path="video_test.mp4"
|
| 926 |
+
frames = encode_video(video_path)
|
| 927 |
+
question = "Describe the video"
|
| 928 |
+
msgs = [
|
| 929 |
+
{'role': 'user', 'content': frames + [question]},
|
| 930 |
+
]
|
| 931 |
+
|
| 932 |
+
# Set decode params for video
|
| 933 |
+
params = {}
|
| 934 |
+
params["use_image_id"] = False
|
| 935 |
+
params["max_slice_nums"] = 2 # 如果cuda OOM且视频分辨率大于448*448可设为1
|
| 936 |
+
|
| 937 |
+
answer = model.chat(
|
| 938 |
+
image=None,
|
| 939 |
+
msgs=msgs,
|
| 940 |
+
tokenizer=tokenizer,
|
| 941 |
+
**params
|
| 942 |
+
)
|
| 943 |
+
print(answer)
|
| 944 |
+
```
|
| 945 |
+
</details>
|
r1-a/response_generation/minicpm/MiniCPM-o/docs/omnilmm.md
ADDED
|
@@ -0,0 +1,183 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
## OmniLMM-12B
|
| 2 |
+
|
| 3 |
+
> OmniLMM-12B 发布于本项目早期。推荐您使用我们[最新发布的模型](./README_zh.md),以获得更高效的推理和更强大的性能体验。
|
| 4 |
+
|
| 5 |
+
> 归档时间:2024-05-19
|
| 6 |
+
|
| 7 |
+
**OmniLMM-12B** 是当前系列中性能最佳的版本。该模型基于EVA02-5B和Zephyr-7B-β初始化构建,并使用perceiver resampler连接,采用了课程学习的方法在多模态数据上进行训练。该模型具有三个特点:
|
| 8 |
+
|
| 9 |
+
- 🔥 **性能领先。**
|
| 10 |
+
|
| 11 |
+
OmniLMM-12B 相比其他同规模模型在多个基准测试中取得**领先的性能**(包括 MME、MMBench、SEED-Bench 等),模型掌握了较为丰富的多模态世界知识。
|
| 12 |
+
|
| 13 |
+
- 🏆 **行为可信。**
|
| 14 |
+
|
| 15 |
+
多模态大模型的幻觉问题备受关注,模型经常生成和图像中的事实不符的文本(例如,确信地描述图片中并不存在的物体)。OmniLMM-12B是 **第一个通过多模态 RLHF 对齐的综合能力优秀的开源多模态大模型**(借助 [RLHF-V](https://rlhf-v.github.io/) [CVPR'24] 系列技术)。该模型在 [MMHal-Bench](https://huggingface.co/datasets/Shengcao1006/MMHal-Bench) 幻觉评测基准上达到**开源模型最佳水平**,并在 [Object HalBench](https://arxiv.org/abs/2312.00849) 中**优于GPT-4V**。
|
| 16 |
+
|
| 17 |
+
- 🕹 **实时多模态交互。**
|
| 18 |
+
|
| 19 |
+
我们尝试结合OmniLMM-12B和GPT-3.5 (纯文本模型) ,实现**实时多模态交互助手**。该模型接受来自摄像头的视频流,并借助工具处理语音输入输出。虽然还很初步,我们发现该模型无需视频编辑可以**复现Gemini演示视频中的一些有趣例子**。
|
| 20 |
+
|
| 21 |
+
### 评测结果 <!-- omit in toc -->
|
| 22 |
+
|
| 23 |
+
<div align="center">
|
| 24 |
+
<img src=assets/radar_omnilmm12b.png width=66% />
|
| 25 |
+
</div>
|
| 26 |
+
<details>
|
| 27 |
+
<summary> MME, MMBench, MMMU, MMBench, MMHal-Bench, Object HalBench, SeedBench, LLaVA Bench W, MathVista 上的详细评测结果。 </summary>
|
| 28 |
+
|
| 29 |
+
<table>
|
| 30 |
+
<thead>
|
| 31 |
+
<tr>
|
| 32 |
+
<th align="left">Model</th>
|
| 33 |
+
<th>Size</th>
|
| 34 |
+
<th>MME</th>
|
| 35 |
+
<th nowrap="nowrap">MMB dev (en)</th>
|
| 36 |
+
<th nowrap="nowrap" >MMMU val</th>
|
| 37 |
+
<th nowrap="nowrap" >MMHal-Bench</th>
|
| 38 |
+
<th nowrap="nowrap" >Object HalBench</th>
|
| 39 |
+
<th nowrap="nowrap" >SeedBench-I</th>
|
| 40 |
+
<th>MathVista</th>
|
| 41 |
+
<th nowrap="nowrap" >LLaVA Bench</th>
|
| 42 |
+
</tr>
|
| 43 |
+
</thead>
|
| 44 |
+
<tbody align="center">
|
| 45 |
+
<tr>
|
| 46 |
+
<td align="left">GPT-4V†</td>
|
| 47 |
+
<td>-</td>
|
| 48 |
+
<td>1771.5</td>
|
| 49 |
+
<td>75.1 </td>
|
| 50 |
+
<td>56.8</td>
|
| 51 |
+
<td>3.53 / 70.8</td>
|
| 52 |
+
<td>86.4 / 92.7</td>
|
| 53 |
+
<td>71.6 </td>
|
| 54 |
+
<td>47.8 </td>
|
| 55 |
+
<td>93.1 </td>
|
| 56 |
+
</tr>
|
| 57 |
+
<tr>
|
| 58 |
+
<td nowrap="nowrap" align="left">Qwen-VL-Plus†</td>
|
| 59 |
+
<td>-</td>
|
| 60 |
+
<td>2183.4</td>
|
| 61 |
+
<td>66.2 </td>
|
| 62 |
+
<td>45.2</td>
|
| 63 |
+
<td>- </td>
|
| 64 |
+
<td>- </td>
|
| 65 |
+
<td>65.7 </td>
|
| 66 |
+
<td>36.0 </td>
|
| 67 |
+
<td>73.7 </td>
|
| 68 |
+
</tr>
|
| 69 |
+
<tr>
|
| 70 |
+
<td align="left">Yi-VL 6B</td>
|
| 71 |
+
<td align="right">6.7B </td>
|
| 72 |
+
<td>1915.1 </td>
|
| 73 |
+
<td>68.6 </td>
|
| 74 |
+
<td>40.3 </td>
|
| 75 |
+
<td>- </td>
|
| 76 |
+
<td>- </td>
|
| 77 |
+
<td>67.5 </td>
|
| 78 |
+
<td>28.8 </td>
|
| 79 |
+
<td>51.9 </td>
|
| 80 |
+
</tr>
|
| 81 |
+
<tr>
|
| 82 |
+
<td nowrap="nowrap" align="left" >Qwen-VL-Chat</td>
|
| 83 |
+
<td align="right">9.6B</td>
|
| 84 |
+
<td>1860.0</td>
|
| 85 |
+
<td>60.6 </td>
|
| 86 |
+
<td>35.9</td>
|
| 87 |
+
<td>2.93 / 59.4</td>
|
| 88 |
+
<td>56.2 / 80.0</td>
|
| 89 |
+
<td>64.8 </td>
|
| 90 |
+
<td>33.8 </td>
|
| 91 |
+
<td>67.7 </td>
|
| 92 |
+
</tr>
|
| 93 |
+
<tr>
|
| 94 |
+
<td align="left" >CogVLM-Chat</td>
|
| 95 |
+
<td align="right">17.4B</td>
|
| 96 |
+
<td>1736.6</td>
|
| 97 |
+
<td>63.7 </td>
|
| 98 |
+
<td>32.1 </td>
|
| 99 |
+
<td>2.68 / 52.1 </td>
|
| 100 |
+
<td>73.6 / 87.4 </td>
|
| 101 |
+
<td>68.8 </td>
|
| 102 |
+
<td>34.7 </td>
|
| 103 |
+
<td>73.9 </td>
|
| 104 |
+
</tr>
|
| 105 |
+
<tr>
|
| 106 |
+
<td align="left" >LLaVA 1.5</td>
|
| 107 |
+
<td align="right">13.6B </td>
|
| 108 |
+
<td>1808.4 </td>
|
| 109 |
+
<td>68.2 </td>
|
| 110 |
+
<td>36.4 </td>
|
| 111 |
+
<td>2.71 / 51.0 </td>
|
| 112 |
+
<td>53.7 / 77.4 </td>
|
| 113 |
+
<td>68.1 </td>
|
| 114 |
+
<td>26.4 </td>
|
| 115 |
+
<td>64.6 </td>
|
| 116 |
+
</tr>
|
| 117 |
+
<tr>
|
| 118 |
+
<td nowrap="nowrap" align="left" ><b>OmniLMM-12B</b></td>
|
| 119 |
+
<td align="right">11.6B </td>
|
| 120 |
+
<td>1935.8 </td>
|
| 121 |
+
<td>71.6 </td>
|
| 122 |
+
<td>40.7 </td>
|
| 123 |
+
<td>3.45 / 68.8 </td>
|
| 124 |
+
<td>90.3 / 95.5 </td>
|
| 125 |
+
<td>71.1 </td>
|
| 126 |
+
<td>34.9 </td>
|
| 127 |
+
<td>72.0 </td>
|
| 128 |
+
</tr>
|
| 129 |
+
</tbody>
|
| 130 |
+
</table>
|
| 131 |
+
<small>†: 闭源模型</small>
|
| 132 |
+
<br>
|
| 133 |
+
</details>
|
| 134 |
+
|
| 135 |
+
### 典型示例 <!-- omit in toc -->
|
| 136 |
+
|
| 137 |
+
<table align="center" >
|
| 138 |
+
<p align="center" >
|
| 139 |
+
<img src="assets/omnilmm-12b-examples_2.png" />
|
| 140 |
+
</p>
|
| 141 |
+
</table>
|
| 142 |
+
|
| 143 |
+
|
| 144 |
+
我们结合 OmniLMM-12B 和 ChatGPT-3.5 (纯文本模型) 尝试构建 **实时多模态交互助手**. OmniLMM-12B 将视频帧转为对应的图像描述并输入给ChatGPT-3.5来生成对用户指令的响应。演示视频未经编辑。
|
| 145 |
+
|
| 146 |
+
<div align="center" >
|
| 147 |
+
<video controls src="https://github.com/OpenBMB/OmniLMM/assets/157115220/8fec13bf-bb47-4bf8-8f8c-d0b716a964ec" type="video/mp4" width=80%/>
|
| 148 |
+
</div>
|
| 149 |
+
|
| 150 |
+
## Online Demo
|
| 151 |
+
|
| 152 |
+
欢迎通过以下链接使用我们的网页端推理服务: [OmniLMM-12B](http://120.92.209.146:8081) | [MiniCPM-V 2.0](http://120.92.209.146:80).
|
| 153 |
+
|
| 154 |
+
## 安装
|
| 155 |
+
|
| 156 |
+
1. 克隆我们的仓库并跳转到相应目录
|
| 157 |
+
|
| 158 |
+
```bash
|
| 159 |
+
git clone https://github.com/OpenBMB/MiniCPM-V.git
|
| 160 |
+
cd MiniCPM-V
|
| 161 |
+
```
|
| 162 |
+
|
| 163 |
+
1. 创建 conda 环境
|
| 164 |
+
|
| 165 |
+
```Shell
|
| 166 |
+
conda create -n MiniCPMV python=3.10 -y
|
| 167 |
+
conda activate MiniCPMV
|
| 168 |
+
```
|
| 169 |
+
|
| 170 |
+
3. 安装依赖
|
| 171 |
+
|
| 172 |
+
```shell
|
| 173 |
+
pip install -r requirements.txt
|
| 174 |
+
```
|
| 175 |
+
|
| 176 |
+
## 推理
|
| 177 |
+
|
| 178 |
+
### 模型库
|
| 179 |
+
|
| 180 |
+
| 模型 | 简介 | 下载链接 |
|
| 181 |
+
|:----------------------|:-------------------|:---------------:|
|
| 182 |
+
| OmniLMM-12B | 性能最强的版本 | [🤗](https://huggingface.co/openbmb/OmniLMM-12B) [<img src="./assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/OmniLMM-12B/files) |
|
| 183 |
+
|
r1-a/response_generation/minicpm/MiniCPM-o/docs/omnilmm_en.md
ADDED
|
@@ -0,0 +1,155 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
## OmniLMM-12B
|
| 2 |
+
|
| 3 |
+
> OmniLMM-12B is released at early time of this project. We recommond you to use our [recently released models](./README.md), for better performance and efficiency.
|
| 4 |
+
|
| 5 |
+
> Archieve at: 2024-05-19
|
| 6 |
+
|
| 7 |
+
|
| 8 |
+
**OmniLMM-12B** is the most capable version. The model is built based on EVA02-5B and Zephyr-7B-β, connected with a perceiver resampler layer, and trained on multimodal data in a curriculum fashion. The model has three notable features:
|
| 9 |
+
|
| 10 |
+
- 🔥 **Strong Performance.**
|
| 11 |
+
|
| 12 |
+
OmniLMM-12B achieves **leading performance** among models with comparable sizes, surpassing established LMMs on multiple benchmarks (including MME, MMBench, SEED-Bench, etc). The model also endows rich multi-modal world knowledge.
|
| 13 |
+
|
| 14 |
+
- 🏆 **Trustworthy Behavior.**
|
| 15 |
+
|
| 16 |
+
LMMs are known for suffering from hallucination, often generating text that is not factually grounded in images (e.g., faithfully describing non-existing objects in images). OmniLMM-12B is **the first state-of-the-art open-source LMM aligned via multimodal RLHF for trustworthy behavior** (using the recent [RLHF-V](https://rlhf-v.github.io/) technique). It **ranks #1** among open-source models on [MMHal-Bench](https://huggingface.co/datasets/Shengcao1006/MMHal-Bench), and **outperforms GPT-4V** on [Object HalBench](https://arxiv.org/abs/2312.00849).
|
| 17 |
+
|
| 18 |
+
- 🕹 **Real-time Multimodal Interaction.**
|
| 19 |
+
|
| 20 |
+
We combine the OmniLMM-12B and GPT-3.5 (text-only) into a **real-time multimodal interactive assistant**. The assistant accepts video streams from the camera and speech streams from the microphone and emits speech output. While still primary, we find the model can **replicate some of the fun cases shown in the Gemini Demo video, without any video edition**.
|
| 21 |
+
|
| 22 |
+
|
| 23 |
+
### Evaluation <!-- omit in toc -->
|
| 24 |
+
<div align="center">
|
| 25 |
+
<img src=assets/radar_omnilmm12b.png width=66% />
|
| 26 |
+
</div>
|
| 27 |
+
<details>
|
| 28 |
+
<summary>Click to view results on MME, MMBench, MMMU, MMBench, MMHal-Bench, Object HalBench, SeedBench, LLaVA Bench, MathVista. </summary>
|
| 29 |
+
|
| 30 |
+
<table>
|
| 31 |
+
<thead>
|
| 32 |
+
<tr>
|
| 33 |
+
<th align="left">Model</th>
|
| 34 |
+
<th>Size</th>
|
| 35 |
+
<th>MME</th>
|
| 36 |
+
<th nowrap="nowrap">MMB dev (en)</th>
|
| 37 |
+
<th nowrap="nowrap" >MMMU val</th>
|
| 38 |
+
<th nowrap="nowrap" >MMHal-Bench</th>
|
| 39 |
+
<th nowrap="nowrap" >Object HalBench</th>
|
| 40 |
+
<th nowrap="nowrap" >SeedBench-I</th>
|
| 41 |
+
<th>MathVista</th>
|
| 42 |
+
<th nowrap="nowrap" >LLaVA Bench</th>
|
| 43 |
+
</tr>
|
| 44 |
+
</thead>
|
| 45 |
+
<tbody align="center">
|
| 46 |
+
<tr>
|
| 47 |
+
<td align="left">GPT-4V†</td>
|
| 48 |
+
<td>-</td>
|
| 49 |
+
<td>1771.5</td>
|
| 50 |
+
<td>75.1 </td>
|
| 51 |
+
<td>56.8</td>
|
| 52 |
+
<td>3.53 / 70.8</td>
|
| 53 |
+
<td>86.4 / 92.7</td>
|
| 54 |
+
<td>71.6 </td>
|
| 55 |
+
<td>47.8 </td>
|
| 56 |
+
<td>93.1 </td>
|
| 57 |
+
</tr>
|
| 58 |
+
<tr>
|
| 59 |
+
<td nowrap="nowrap" align="left">Qwen-VL-Plus†</td>
|
| 60 |
+
<td>-</td>
|
| 61 |
+
<td>2183.4</td>
|
| 62 |
+
<td>66.2 </td>
|
| 63 |
+
<td>45.2</td>
|
| 64 |
+
<td>- </td>
|
| 65 |
+
<td>- </td>
|
| 66 |
+
<td>65.7 </td>
|
| 67 |
+
<td>36.0 </td>
|
| 68 |
+
<td>73.7 </td>
|
| 69 |
+
</tr>
|
| 70 |
+
<tr>
|
| 71 |
+
<td align="left">Yi-VL 6B</td>
|
| 72 |
+
<td align="right">6.7B </td>
|
| 73 |
+
<td>1915.1 </td>
|
| 74 |
+
<td>68.6 </td>
|
| 75 |
+
<td>40.3 </td>
|
| 76 |
+
<td>- </td>
|
| 77 |
+
<td>- </td>
|
| 78 |
+
<td>67.5 </td>
|
| 79 |
+
<td>28.8 </td>
|
| 80 |
+
<td>51.9 </td>
|
| 81 |
+
</tr>
|
| 82 |
+
<tr>
|
| 83 |
+
<td nowrap="nowrap" align="left" >Qwen-VL-Chat</td>
|
| 84 |
+
<td align="right">9.6B</td>
|
| 85 |
+
<td>1860.0</td>
|
| 86 |
+
<td>60.6 </td>
|
| 87 |
+
<td>35.9</td>
|
| 88 |
+
<td>2.93 / 59.4</td>
|
| 89 |
+
<td>56.2 / 80.0</td>
|
| 90 |
+
<td>64.8 </td>
|
| 91 |
+
<td>33.8 </td>
|
| 92 |
+
<td>67.7 </td>
|
| 93 |
+
</tr>
|
| 94 |
+
<tr>
|
| 95 |
+
<td align="left" >CogVLM-Chat</td>
|
| 96 |
+
<td align="right">17.4B</td>
|
| 97 |
+
<td>1736.6</td>
|
| 98 |
+
<td>63.7 </td>
|
| 99 |
+
<td>32.1 </td>
|
| 100 |
+
<td>2.68 / 52.1 </td>
|
| 101 |
+
<td>73.6 / 87.4 </td>
|
| 102 |
+
<td>68.8 </td>
|
| 103 |
+
<td>34.7 </td>
|
| 104 |
+
<td>73.9 </td>
|
| 105 |
+
</tr>
|
| 106 |
+
<tr>
|
| 107 |
+
<td align="left" >LLaVA 1.5</td>
|
| 108 |
+
<td align="right">13.6B </td>
|
| 109 |
+
<td>1808.4 </td>
|
| 110 |
+
<td>68.2 </td>
|
| 111 |
+
<td>36.4 </td>
|
| 112 |
+
<td>2.71 / 51.0 </td>
|
| 113 |
+
<td>53.7 / 77.4 </td>
|
| 114 |
+
<td>68.1 </td>
|
| 115 |
+
<td>26.4 </td>
|
| 116 |
+
<td>64.6 </td>
|
| 117 |
+
</tr>
|
| 118 |
+
<tr>
|
| 119 |
+
<td nowrap="nowrap" align="left" ><b>OmniLMM-12B</b></td>
|
| 120 |
+
<td align="right">11.6B </td>
|
| 121 |
+
<td>1935.8 </td>
|
| 122 |
+
<td>71.6 </td>
|
| 123 |
+
<td>40.7 </td>
|
| 124 |
+
<td>3.45 / 68.8 </td>
|
| 125 |
+
<td>90.3 / 95.5 </td>
|
| 126 |
+
<td>71.1 </td>
|
| 127 |
+
<td>34.9 </td>
|
| 128 |
+
<td>72.0 </td>
|
| 129 |
+
</tr>
|
| 130 |
+
</tbody>
|
| 131 |
+
</table>
|
| 132 |
+
<small>†: Proprietary models</small>
|
| 133 |
+
<br>
|
| 134 |
+
</details>
|
| 135 |
+
|
| 136 |
+
### Examples <!-- omit in toc -->
|
| 137 |
+
|
| 138 |
+
<table align="center" >
|
| 139 |
+
<p align="center" >
|
| 140 |
+
<img src="assets/omnilmm-12b-examples_2.png" />
|
| 141 |
+
</p>
|
| 142 |
+
</table>
|
| 143 |
+
|
| 144 |
+
|
| 145 |
+
We combine the OmniLMM-12B and GPT-3.5 (text-only) into a **real-time multimodal interactive assistant**. Video frames are described in text using OmniLMM-12B, and ChatGPT 3.5 (text-only) is employed to generate response according to the descriptions and user prompts. The demo video is a raw recording without edition.
|
| 146 |
+
|
| 147 |
+
<div align="center" >
|
| 148 |
+
<video controls src="https://github.com/OpenBMB/OmniLMM/assets/157115220/485a8f52-fb4d-4eca-8fee-506347efcfc6" type="video/mp4" width=80%/>
|
| 149 |
+
</div>
|
| 150 |
+
|
| 151 |
+
### Model Zoo
|
| 152 |
+
|
| 153 |
+
| Model | Description | Download Link |
|
| 154 |
+
|:----------------------|:-------------------|:---------------:|
|
| 155 |
+
| OmniLMM-12B | The most capable version with leading performance. | [🤗](https://huggingface.co/openbmb/OmniLMM-12B) [<img src="./assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/OmniLMM-12B/files) |
|
r1-a/response_generation/minicpm/MiniCPM-o/docs/swift_train_and_infer.md
ADDED
|
@@ -0,0 +1,135 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
## SWIFT install
|
| 2 |
+
You can quickly install SWIFT using bash commands.
|
| 3 |
+
|
| 4 |
+
``` bash
|
| 5 |
+
git clone https://github.com/modelscope/swift.git
|
| 6 |
+
cd swift
|
| 7 |
+
pip install -r requirements.txt
|
| 8 |
+
pip install -e '.[llm]'
|
| 9 |
+
```
|
| 10 |
+
|
| 11 |
+
## SWIFT Infer
|
| 12 |
+
Inference using SWIFT can be carried out in two ways: through a command line interface and via Python code.
|
| 13 |
+
|
| 14 |
+
### Quick start
|
| 15 |
+
Here are the steps to launch SWIFT from the Bash command line:
|
| 16 |
+
|
| 17 |
+
1. Run the bash code will download the model of MiniCPM-Llama3-V-2_5 and run the inference
|
| 18 |
+
``` shell
|
| 19 |
+
CUDA_VISIBLE_DEVICES=0 swift infer --model_type minicpm-v-v2_5-chat
|
| 20 |
+
```
|
| 21 |
+
|
| 22 |
+
2. You can also run the code with more arguments below to run the inference:
|
| 23 |
+
```
|
| 24 |
+
model_id_or_path # Can be the model ID from Hugging Face or the local path to the model
|
| 25 |
+
infer_backend ['AUTO', 'vllm', 'pt'] # Backend for inference, default is auto
|
| 26 |
+
dtype ['bf16', 'fp16', 'fp32', 'AUTO'] # Computational precision
|
| 27 |
+
max_length # Maximum length
|
| 28 |
+
max_new_tokens: int = 2048 # Maximum number of tokens to generate
|
| 29 |
+
do_sample: bool = True # Whether to sample during generation
|
| 30 |
+
temperature: float = 0.3 # Temperature coefficient during generation
|
| 31 |
+
top_k: int = 20
|
| 32 |
+
top_p: float = 0.7
|
| 33 |
+
repetition_penalty: float = 1. # Penalty for repetition
|
| 34 |
+
num_beams: int = 1 # Number of beams for beam search
|
| 35 |
+
stop_words: List[str] = None # List of stop words
|
| 36 |
+
quant_method ['bnb', 'hqq', 'eetq', 'awq', 'gptq', 'aqlm'] # Quantization method for the model
|
| 37 |
+
quantization_bit [0, 1, 2, 3, 4, 8] # Default is 0, which means no quantization is used
|
| 38 |
+
```
|
| 39 |
+
3. Example:
|
| 40 |
+
``` shell
|
| 41 |
+
CUDA_VISIBLE_DEVICES=0,1 swift infer \
|
| 42 |
+
--model_type minicpm-v-v2_5-chat \
|
| 43 |
+
--model_id_or_path /root/ld/ld_model_pretrain/MiniCPM-Llama3-V-2_5 \
|
| 44 |
+
--dtype bf16
|
| 45 |
+
```
|
| 46 |
+
### Python code with SWIFT infer
|
| 47 |
+
The following demonstrates using Python code to initiate inference with the MiniCPM-Llama3-V-2_5 model through SWIFT.
|
| 48 |
+
|
| 49 |
+
```python
|
| 50 |
+
import os
|
| 51 |
+
os.environ['CUDA_VISIBLE_DEVICES'] = '0,1' # Set the number of GPUs to use
|
| 52 |
+
|
| 53 |
+
from swift.llm import (
|
| 54 |
+
get_model_tokenizer, get_template, inference, ModelType,
|
| 55 |
+
get_default_template_type, inference_stream
|
| 56 |
+
) # Import necessary modules
|
| 57 |
+
|
| 58 |
+
from swift.utils import seed_everything # Set random seed
|
| 59 |
+
import torch
|
| 60 |
+
|
| 61 |
+
model_type = ModelType.minicpm_v_v2_5_chat
|
| 62 |
+
template_type = get_default_template_type(model_type) # Obtain the template type, primarily used for constructing special tokens and image processing workflow
|
| 63 |
+
print(f'template_type: {template_type}')
|
| 64 |
+
|
| 65 |
+
model, tokenizer = get_model_tokenizer(model_type, torch.bfloat16,
|
| 66 |
+
model_id_or_path='/root/ld/ld_model_pretrain/MiniCPM-Llama3-V-2_5',
|
| 67 |
+
model_kwargs={'device_map': 'auto'}) # Load the model, set model type, model path, model parameters, device allocation, etc., computation precision, etc.
|
| 68 |
+
model.generation_config.max_new_tokens = 256
|
| 69 |
+
template = get_template(template_type, tokenizer) # Construct the template based on the template type
|
| 70 |
+
seed_everything(42)
|
| 71 |
+
|
| 72 |
+
images = ['http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/road.png'] # Image URL
|
| 73 |
+
query = '距离各城市多远?' # Note: Query is still in Chinese, consider translating if needed
|
| 74 |
+
response, history = inference(model, template, query, images=images) # Obtain results through inference
|
| 75 |
+
print(f'query: {query}')
|
| 76 |
+
print(f'response: {response}')
|
| 77 |
+
|
| 78 |
+
# Streaming output
|
| 79 |
+
query = '距离最远的城市是哪?' # Note: Query is still in Chinese, consider translating if needed
|
| 80 |
+
gen = inference_stream(model, template, query, history, images=images) # Call the streaming output interface
|
| 81 |
+
print_idx = 0
|
| 82 |
+
print(f'query: {query}\nresponse: ', end='')
|
| 83 |
+
for response, history in gen:
|
| 84 |
+
delta = response[print_idx:]
|
| 85 |
+
print(delta, end='', flush=True)
|
| 86 |
+
print_idx = len(response)
|
| 87 |
+
print()
|
| 88 |
+
print(f'history: {history}')
|
| 89 |
+
```
|
| 90 |
+
|
| 91 |
+
## SWIFT train
|
| 92 |
+
SWIFT supports training on the local dataset,the training steps are as follows:
|
| 93 |
+
1. Make the train data like this:
|
| 94 |
+
```jsonl
|
| 95 |
+
{"query": "What does this picture describe?", "response": "This picture has a giant panda.", "images": ["local_image_path"]}
|
| 96 |
+
{"query": "What does this picture describe?", "response": "This picture has a giant panda.", "history": [], "images": ["image_path"]}
|
| 97 |
+
{"query": "Is bamboo tasty?", "response": "It seems pretty tasty judging by the panda's expression.", "history": [["What's in this picture?", "There's a giant panda in this picture."], ["What is the panda doing?", "Eating bamboo."]], "images": ["image_url"]}
|
| 98 |
+
```
|
| 99 |
+
2. LoRA Tuning:
|
| 100 |
+
|
| 101 |
+
The LoRA target model are k and v weight in LLM you should pay attention to the eval_steps,maybe you should set the eval_steps to a large value, like 200000,beacuase in the eval time , SWIFT will return a memory bug so you should set the eval_steps to a very large value.
|
| 102 |
+
```shell
|
| 103 |
+
# Experimental environment: A100
|
| 104 |
+
# 32GB GPU memory
|
| 105 |
+
CUDA_VISIBLE_DEVICES=0 swift sft \
|
| 106 |
+
--model_type minicpm-v-v2_5-chat \
|
| 107 |
+
--dataset coco-en-2-mini \
|
| 108 |
+
```
|
| 109 |
+
3. All parameters finetune:
|
| 110 |
+
|
| 111 |
+
When the argument of lora_target_modules is ALL, the model will finetune all the parameters.
|
| 112 |
+
```shell
|
| 113 |
+
CUDA_VISIBLE_DEVICES=0,1 swift sft \
|
| 114 |
+
--model_type minicpm-v-v2_5-chat \
|
| 115 |
+
--dataset coco-en-2-mini \
|
| 116 |
+
--lora_target_modules ALL \
|
| 117 |
+
--eval_steps 200000
|
| 118 |
+
```
|
| 119 |
+
|
| 120 |
+
## LoRA Merge and Infer
|
| 121 |
+
The LoRA weight can be merge to the base model and then load to infer.
|
| 122 |
+
|
| 123 |
+
1. Load the LoRA weight to infer run the follow code:
|
| 124 |
+
```shell
|
| 125 |
+
CUDA_VISIBLE_DEVICES=0 swift infer \
|
| 126 |
+
--ckpt_dir /your/lora/save/checkpoint
|
| 127 |
+
```
|
| 128 |
+
2. Merge the LoRA weight to the base model:
|
| 129 |
+
|
| 130 |
+
The code will load and merge the LoRA weight to the base model, save the merge model to the LoRA save path and load the merge model to infer
|
| 131 |
+
```shell
|
| 132 |
+
CUDA_VISIBLE_DEVICES=0 swift infer \
|
| 133 |
+
--ckpt_dir your/lora/save/checkpoint \
|
| 134 |
+
--merge_lora true
|
| 135 |
+
```
|