arashkermani commited on
Commit
8988ef6
·
verified ·
1 Parent(s): 0bad3bb

Upload folder using huggingface_hub

Browse files
Files changed (2) hide show
  1. README.md +1434 -164
  2. pytorch_model.bin +2 -2
README.md CHANGED
@@ -1,231 +1,1501 @@
1
  ---
2
- license: apache-2.0
 
 
3
  library_name: transformers
 
 
4
  tags:
5
- - vision
6
- - image-text-to-text
7
- - multimodal
8
- - test-model
9
- - tiny-model
10
- - openvino
11
- - optimum-intel
12
- pipeline_tag: image-text-to-text
 
 
 
 
 
 
 
13
  ---
14
 
15
- # Tiny Random MiniCPM-o-2_6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
 
17
- ## Model Description
18
 
19
- This is a **tiny random-initialized version** of the [openbmb/MiniCPM-o-2_6](https://huggingface.co/openbmb/MiniCPM-o-2_6) multimodal vision-language model, designed specifically for **testing and CI/CD purposes** in the [optimum-intel](https://github.com/huggingface/optimum-intel) library.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20
 
21
- **⚠️ Important**: This model has randomly initialized weights and is NOT intended for actual inference. It is designed solely for:
22
- - Testing model loading and export functionality
23
- - CI/CD pipeline validation
24
- - OpenVINO conversion testing ✅
25
- - Quantization workflow testing
26
 
27
- ## Model Specifications
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
28
 
29
- - **Architecture**: MiniCPM-o-2_6 (multimodal: vision + text + audio + TTS)
30
- - **Parameters**: 17,390,468 (~17.4M parameters)
31
- - **Model Binary Size**: 66.45 MB
32
- - **Total Repository Size**: ~82 MB
33
- - **Original Model**: [openbmb/MiniCPM-o-2_6](https://huggingface.co/openbmb/MiniCPM-o-2_6) (~18 GB)
34
- - **Size Reduction**: 219× smaller than the full model
35
- - **OpenVINO Export**: ✅ Fully supported
36
- - **All Components Enabled**: vision, audio, and TTS modules initialized
37
 
38
- ## Architecture Details
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
39
 
40
- ### Language Model (LLM) Component
41
- - `num_hidden_layers`: 2 (reduced from 40)
42
- - `hidden_size`: 256 (reduced from 2048)
43
- - `intermediate_size`: 512 (reduced from 8192)
44
- - `num_attention_heads`: 4 (reduced from 32)
45
- - `vocab_size`: 320 (reduced from 151,700)
46
- - `max_position_embeddings`: 128 (reduced from 8192)
47
 
48
- ### Vision Component (SigLIP-based)
49
- - `hidden_size`: 8
50
- - `num_hidden_layers`: 1
51
 
52
- ### Audio Component (Whisper-based)
53
- - `d_model`: 64
54
- - `encoder_layers`: 1
55
- - `decoder_layers`: 1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
56
 
57
- ### TTS Component
58
- - `hidden_size`: 8
59
- - `num_layers`: 1
60
 
61
- All architectural components are present and properly initialized, ensuring full compatibility with OpenVINO export and testing workflows.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
62
 
63
  ## Usage
 
 
 
 
 
 
 
 
 
 
 
 
 
 
64
 
65
- ### Loading with Transformers
66
 
 
67
  ```python
68
- from transformers import AutoModelForCausalLM, AutoProcessor
69
- import torch
70
 
71
- model_id = "arashkermani/tiny-random-MiniCPM-o-2_6"
 
 
72
 
73
- # Load model
74
- model = AutoModelForCausalLM.from_pretrained(
75
- model_id,
 
 
76
  trust_remote_code=True,
77
- torch_dtype=torch.float32,
78
- device_map="cpu"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
79
  )
80
 
81
- # Load processor
82
- processor = AutoProcessor.from_pretrained(
83
- model_id,
84
- trust_remote_code=True
 
 
 
 
 
 
 
 
 
 
 
85
  )
86
 
87
- # Test forward pass
88
- input_ids = torch.randint(0, 320, (1, 5))
89
- position_ids = torch.arange(5).unsqueeze(0)
90
 
91
- data = {
92
- "input_ids": input_ids,
93
- "pixel_values": [[]],
94
- "tgt_sizes": [[]],
95
- "image_bound": [[]],
96
- "position_ids": position_ids,
97
- }
98
 
99
- with torch.no_grad():
100
- outputs = model(data=data)
 
 
 
 
 
 
 
 
 
101
 
102
- print(f"Logits shape: {outputs.logits.shape}") # (1, 5, 320)
103
  ```
104
 
105
- ### Using with Optimum-Intel (OpenVINO)
 
 
 
106
 
107
  ```python
108
- from optimum.intel.openvino import OVModelForVisualCausalLM
109
- from transformers import AutoProcessor
 
110
 
111
- model_id = "arashkermani/tiny-random-MiniCPM-o-2_6"
 
 
 
112
 
113
- # Load model for OpenVINO
114
- model = OVModelForVisualCausalLM.from_pretrained(
115
- model_id,
116
- trust_remote_code=True,
117
- export=True
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
118
  )
119
 
120
- processor = AutoProcessor.from_pretrained(
121
- model_id,
122
- trust_remote_code=True
 
 
 
 
 
 
 
 
 
 
123
  )
 
124
  ```
125
 
126
- ### Export to OpenVINO
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
127
 
128
- ```bash
129
- optimum-cli export openvino \
130
- -m arashkermani/tiny-random-MiniCPM-o-2_6 \
131
- minicpm-o-openvino \
132
- --task=image-text-to-text \
133
- --trust-remote-code
 
 
 
 
 
 
 
 
 
134
  ```
135
 
136
- ## Intended Use
 
 
137
 
138
- This model is intended **exclusively** for:
139
- - ✅ Testing optimum-intel OpenVINO export functionality
140
- - ✅ CI/CD pipeline validation
141
- - ✅ Model loading and compatibility testing
142
- - ✅ Quantization workflow testing
143
- - ✅ Fast prototyping and debugging
144
 
145
- **Not intended for**:
146
- - Production inference
147
- - ❌ Actual image-text-to-text tasks
148
- - ❌ Model quality evaluation
149
- - ❌ Benchmarking performance metrics
150
 
151
- ## Training Details
152
 
153
- This model was generated by:
154
- 1. Loading the config from `openbmb/MiniCPM-o-2_6`
155
- 2. Reducing all dimensions to minimal viable values
156
- 3. Initializing weights randomly using `AutoModelForCausalLM.from_config()`
157
- 4. Keeping all components (vision, audio, TTS) enabled for full compatibility
158
- 5. Copying all necessary tokenizer, processor, and custom code files
 
 
 
 
 
159
 
160
- **No training was performed** - all weights are randomly initialized.
161
 
162
- ## Validation Results
163
 
164
- The model has been validated to ensure:
165
- - ✅ Loads with `trust_remote_code=True`
166
- - ✅ Compatible with transformers AutoModel APIs
167
- - ✅ Supports forward pass with expected input format
168
- - ✅ **Compatible with OpenVINO export via optimum-intel**
169
- - ✅ Includes all required custom modules and artifacts
170
- - ✅ All multimodal components (vision/audio/TTS) properly initialized
171
 
172
- ## Comparison with Previous Versions
173
 
174
- | Metric | v1 (components disabled) | v2 (this version) |
175
- |--------|-------------------------|-------------------|
176
- | Parameters | 1.48M | 17.4M |
177
- | Total Size | 21 MB | 82 MB |
178
- | OpenVINO Export | Not supported | Fully supported |
179
- | Vision Module | ❌ Disabled | ✅ Enabled |
180
- | Audio Module | ❌ Disabled | ✅ Enabled |
181
- | TTS Module | ❌ Disabled | ✅ Enabled |
182
 
183
- **Recommendation**: Use this version for full test coverage including OpenVINO export tests.
 
 
 
 
 
 
 
 
 
 
184
 
185
- ## Files Included
186
 
187
- - `config.json` - Model configuration
188
- - `pytorch_model.bin` - Model weights (66.45 MB)
189
- - `generation_config.json` - Generation parameters
190
- - `preprocessor_config.json` - Preprocessor configuration
191
- - `processor_config.json` - Processor configuration
192
- - `tokenizer_config.json` - Tokenizer configuration
193
- - `tokenizer.json` - Fast tokenizer
194
- - `vocab.json` - Vocabulary
195
- - `merges.txt` - BPE merges
196
- - Custom Python modules:
197
- - `modeling_minicpmo.py`
198
- - `configuration_minicpm.py`
199
- - `processing_minicpmo.py`
200
- - `image_processing_minicpmv.py`
201
- - `tokenization_minicpmo_fast.py`
202
- - `modeling_navit_siglip.py`
203
- - `resampler.py`
204
- - `utils.py`
205
 
206
- ## Related Models
207
 
208
- - Original model: [openbmb/MiniCPM-o-2_6](https://huggingface.co/openbmb/MiniCPM-o-2_6)
209
- - Previous test model: [optimum-intel-internal-testing/tiny-random-MiniCPM-o-2_6](https://huggingface.co/optimum-intel-internal-testing/tiny-random-MiniCPM-o-2_6)
210
 
211
- ## License
212
 
213
- This model follows the same license as the original MiniCPM-o-2_6 model (Apache 2.0).
 
 
 
 
214
 
215
- ## Citation
 
 
216
 
217
- If you use this test model in your CI/CD or testing infrastructure, please reference:
218
 
219
- ```bibtex
220
- @misc{tiny-minicpm-o-2_6,
221
- author = {Arash Kermani},
222
- title = {Tiny Random MiniCPM-o-2_6 for Testing},
223
- year = {2026},
224
- publisher = {HuggingFace},
225
- howpublished = {\url{https://huggingface.co/arashkermani/tiny-random-MiniCPM-o-2_6}}
226
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
227
  ```
 
 
 
 
 
 
 
 
 
 
 
 
228
 
229
- ## Contact
 
 
 
 
 
 
 
 
 
 
 
 
230
 
231
- For issues or questions about this test model, please open an issue in the [optimum-intel repository](https://github.com/huggingface/optimum-intel/issues).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ pipeline_tag: any-to-any
3
+ datasets:
4
+ - openbmb/RLAIF-V-Dataset
5
  library_name: transformers
6
+ language:
7
+ - multilingual
8
  tags:
9
+ - minicpm-o
10
+ - omni
11
+ - vision
12
+ - ocr
13
+ - multi-image
14
+ - video
15
+ - custom_code
16
+ - audio
17
+ - speech
18
+ - voice cloning
19
+ - live Streaming
20
+ - realtime speech conversation
21
+ - asr
22
+ - tts
23
+ license: apache-2.0
24
  ---
25
 
26
+ <h1>A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming on Your Phone</h1>
27
+
28
+ [GitHub](https://github.com/OpenBMB/MiniCPM-o) | [Online Demo](https://minicpm-omni-webdemo-us.modelbest.cn) | [Technical Blog](https://openbmb.notion.site/MiniCPM-o-2-6-A-GPT-4o-Level-MLLM-for-Vision-Speech-and-Multimodal-Live-Streaming-on-Your-Phone-185ede1b7a558042b5d5e45e6b237da9) | [Join Us](https://mp.weixin.qq.com/mp/wappoc_appmsgcaptcha?poc_token=HAV8UWijqB3ImPSXecZHlOns7NRgpQw9y9EI2_fE&target_url=https%3A%2F%2Fmp.weixin.qq.com%2Fs%2FKIhH2nCURBXuFXAtYRpuXg%3F)
29
+
30
+
31
+ ### News
32
+
33
+ * [2025.06.20] ⭐️⭐️⭐️ Our official [ollama repository](https://ollama.com/openbmb) is released. Try our latest models with [one click](https://ollama.com/openbmb/minicpm-o2.6)!
34
+
35
+ * [2025.03.01] 🚀🚀🚀 RLAIF-V, which is the alignment technique of MiniCPM-o, is accepted by CVPR 2025!The [code](https://github.com/RLHF-V/RLAIF-V), [dataset](https://huggingface.co/datasets/openbmb/RLAIF-V-Dataset), [paper](https://arxiv.org/abs/2405.17220) are open-sourced!
36
+
37
+ * [2025.01.24] 📢📢📢 MiniCPM-o 2.6 technical report is released! [See Here](https://openbmb.notion.site/MiniCPM-o-2-6-A-GPT-4o-Level-MLLM-for-Vision-Speech-and-Multimodal-Live-Streaming-on-Your-Phone-185ede1b7a558042b5d5e45e6b237da9).
38
+
39
+ * [2025.01.19] ⭐️⭐️⭐️ MiniCPM-o tops GitHub Trending and reaches top-2 on Hugging Face Trending!
40
+
41
+ ## MiniCPM-o 2.6
42
+
43
+
44
+ **MiniCPM-o 2.6** is the latest and most capable model in the MiniCPM-o series. The model is built in an end-to-end fashion based on SigLip-400M, Whisper-medium-300M, ChatTTS-200M, and Qwen2.5-7B with a total of 8B parameters. It exhibits a significant performance improvement over MiniCPM-V 2.6, and introduces new features for real-time speech conversation and multimodal live streaming. Notable features of MiniCPM-o 2.6 include:
45
+
46
+ - 🔥 **Leading Visual Capability.**
47
+ MiniCPM-o 2.6 achieves an average score of 70.2 on OpenCompass, a comprehensive evaluation over 8 popular benchmarks. **With only 8B parameters, it surpasses widely used proprietary models like GPT-4o-202405, Gemini 1.5 Pro, and Claude 3.5 Sonnet** for single image understanding. It also **outperforms GPT-4V and Claude 3.5 Sonnet** in mutli-image and video understanding, and shows promising in-context learning capability.
48
+
49
+ - 🎙 **State-of-the-art Speech Capability.** MiniCPM-o 2.6 supports **bilingual real-time speech conversation with configurable voices** in English and Chinese. It **outperforms GPT-4o-realtime on audio understanding tasks** such as ASR and STT translation, and shows **state-of-the-art performance on speech conversation in both semantic and acoustic evaluations in the open-source community**. It also allows for fun features such as emotion/speed/style control, end-to-end voice cloning, role play, etc.
50
+
51
+ - 🎬 **Strong Multimodal Live Streaming Capability.** As a new feature, MiniCPM-o 2.6 can **accept continous video and audio streams independent of user queries, and support real-time speech interaction**. It **outperforms GPT-4o-202408 and Claude 3.5 Sonnet and shows state-of-art performance in open-source community on StreamingBench**, a comprehensive benchmark for real-time video understanding, omni-source (video & audio) understanding, and multimodal contextual understanding.
52
+
53
+ - 💪 **Strong OCR Capability and Others.**
54
+ Advancing popular visual capabilites from MiniCPM-V series, MiniCPM-o 2.6 can process images with any aspect ratio and up to 1.8 million pixels (e.g., 1344x1344). It achieves **state-of-the-art performance on OCRBench for models under 25B, surpassing proprietary models such as GPT-4o-202405**.
55
+ Based on the the latest [RLAIF-V](https://github.com/RLHF-V/RLAIF-V/) and [VisCPM](https://github.com/OpenBMB/VisCPM) techniques, it features **trustworthy behaviors**, outperforming GPT-4o and Claude 3.5 Sonnet on MMHal-Bench, and supports **multilingual capabilities** on more than 30 languages.
56
+
57
+
58
+ - 🚀 **Superior Efficiency.**
59
+ In addition to its friendly size, MiniCPM-o 2.6 also shows **state-of-the-art token density** (i.e., number of pixels encoded into each visual token). **It produces only 640 tokens when processing a 1.8M pixel image, which is 75% fewer than most models**. This directly improves the inference speed, first-token latency, memory usage, and power consumption. As a result, MiniCPM-o 2.6 can efficiently support **multimodal live streaming** on end-side devices such as iPad.
60
+
61
+ - 💫 **Easy Usage.**
62
+ MiniCPM-o 2.6 can be easily used in various ways: (1) [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-omni/examples/llava/README-minicpmo2.6.md) support for efficient CPU inference on local devices, (2) [int4](https://huggingface.co/openbmb/MiniCPM-o-2_6-int4) and [GGUF](https://huggingface.co/openbmb/MiniCPM-o-2_6-gguf) format quantized models in 16 sizes, (3) [vLLM](#efficient-inference-with-llamacpp-ollama-vllm) support for high-throughput and memory-efficient inference, (4) fine-tuning on new domains and tasks with [LLaMA-Factory](./docs/llamafactory_train.md), (5) quick local WebUI demo setup with [Gradio](#chat-with-our-demo-on-gradio), and (6) online web demo on [server](https://minicpm-omni-webdemo-us.modelbest.cn/).
63
+
64
+
65
+
66
+ **Model Architecture.**
67
+
68
+ - **End-to-end Omni-modal Architecture.** Different modality encoder/decoders are connected and trained in an **end-to-end** fashion to fully exploit rich multimodal knowledge.
69
+ - **Omni-modal Live Streaming Mechanism.** (1) We change the offline modality encoder/decoders into online ones for **streaminig inputs/outputs.** (2) We devise a **time-division multiplexing (TDM) mechanism** for omni-modality streaminig processing in the LLM backbone. It divides parallel omni-modality streams into sequential info within small periodic time slices.
70
+ - **Configurable Speech Modeling Design.** We devise a multimodal system prompt, including traditional text system prompt, and **a new audio system prompt to determine the assistant voice**. This enables flexible voice configurations in inference time, and also facilitates end-to-end voice cloning and description-based voice creation.
71
+
72
+ <div align="center">
73
+ <img src="https://github.com/OpenBMB/MiniCPM-o/raw/main/assets/minicpm-o-26-framework-v2.png" , width=100%>
74
+ </div>
75
+
76
+
77
+ ### Evaluation <!-- omit in toc -->
78
+
79
+ <div align="center">
80
+ <img src="https://github.com/OpenBMB/MiniCPM-o/raw/main/assets/radar.jpg" width=90% />
81
+ </div>
82
+
83
+ #### Visual understanding results
84
+
85
+ **Image Understanding:**
86
+
87
+ <div align="center">
88
+ <table style="margin: 0px auto;">
89
+ <thead>
90
+ <tr>
91
+ <th align="left">Model</th>
92
+ <th>Size</th>
93
+ <th>Token Density<sup>+</sup></th>
94
+ <th>OpenCompass</th>
95
+ <th>OCRBench</th>
96
+ <th>MathVista mini</th>
97
+ <th>ChartQA</th>
98
+ <th>MMVet</th>
99
+ <th>MMStar</th>
100
+ <th>MME</th>
101
+ <th>MMB1.1 test</th>
102
+ <th>AI2D</th>
103
+ <th>MMMU val</th>
104
+ <th>HallusionBench</th>
105
+ <th>TextVQA val</th>
106
+ <th>DocVQA test</th>
107
+ <th>MathVerse mini</th>
108
+ <th>MathVision</th>
109
+ <th>MMHal Score</th>
110
+ </tr>
111
+ </thead>
112
+ <tbody align="center">
113
+ <tr>
114
+ <td colspan="19" align="left"><strong>Proprietary</strong></td>
115
+ </tr>
116
+ <tr>
117
+ <td nowrap="nowrap" align="left">GPT-4o-20240513</td>
118
+ <td>-</td>
119
+ <td>1088</td>
120
+ <td><u>69.9</u></td>
121
+ <td>736</td>
122
+ <td>61.3</td>
123
+ <td>85.7</td>
124
+ <td><strong>69.1</strong></td>
125
+ <td>63.9</td>
126
+ <td>2328.7</td>
127
+ <td>82.2</td>
128
+ <td>84.6</td>
129
+ <td><strong>69.2</strong></td>
130
+ <td><strong>55.0</strong></td>
131
+ <td>-</td>
132
+ <td>92.8</td>
133
+ <td><strong>50.2</strong></td>
134
+ <td><strong>30.4</strong></td>
135
+ <td><u>3.6</u></td>
136
+ </tr>
137
+ <tr>
138
+ <td nowrap="nowrap" align="left">Claude3.5-Sonnet</td>
139
+ <td>-</td>
140
+ <td>750</td>
141
+ <td>67.9</td>
142
+ <td>788</td>
143
+ <td>61.6</td>
144
+ <td><strong>90.8</strong></td>
145
+ <td>66.0</td>
146
+ <td>62.2</td>
147
+ <td>1920.0</td>
148
+ <td>78.5</td>
149
+ <td>80.2</td>
150
+ <td><u>65.9</u></td>
151
+ <td>49.9</td>
152
+ <td>-</td>
153
+ <td><strong>95.2</strong></td>
154
+ <td>-</td>
155
+ <td>-</td>
156
+ <td>3.4</td>
157
+ </tr>
158
+ <tr>
159
+ <td nowrap="nowrap" align="left">Gemini 1.5 Pro</td>
160
+ <td>-</td>
161
+ <td>-</td>
162
+ <td>64.4</td>
163
+ <td>754</td>
164
+ <td>57.7</td>
165
+ <td>81.3</td>
166
+ <td>64.0</td>
167
+ <td>59.1</td>
168
+ <td>2110.6</td>
169
+ <td>73.9</td>
170
+ <td>79.1</td>
171
+ <td>60.6</td>
172
+ <td>45.6</td>
173
+ <td>73.5</td>
174
+ <td>86.5</td>
175
+ <td>-</td>
176
+ <td>19.2</td>
177
+ <td>-</td>
178
+ </tr>
179
+ <tr>
180
+ <td nowrap="nowrap" align="left">GPT-4o-mini-20240718</td>
181
+ <td>-</td>
182
+ <td>1088</td>
183
+ <td>64.1</td>
184
+ <td>785</td>
185
+ <td>52.4</td>
186
+ <td>-</td>
187
+ <td>66.9</td>
188
+ <td>54.8</td>
189
+ <td>2003.4</td>
190
+ <td>76.0</td>
191
+ <td>77.8</td>
192
+ <td>60.0</td>
193
+ <td>46.1</td>
194
+ <td>-</td>
195
+ <td>-</td>
196
+ <td>-</td>
197
+ <td>-</td>
198
+ <td>3.3</td>
199
+ </tr>
200
+ <tr>
201
+ <td colspan="19" align="left"><strong>Open Source</strong></td>
202
+ </tr>
203
+ <tr>
204
+ <td nowrap="nowrap" align="left">Cambrian-34B</td>
205
+ <td>34B</td>
206
+ <td><u>1820</u></td>
207
+ <td>58.3</td>
208
+ <td>591</td>
209
+ <td>50.3</td>
210
+ <td>75.6</td>
211
+ <td>53.2</td>
212
+ <td>54.2</td>
213
+ <td>2049.9</td>
214
+ <td>77.8</td>
215
+ <td>79.5</td>
216
+ <td>50.4</td>
217
+ <td>41.6</td>
218
+ <td>76.7</td>
219
+ <td>75.5</td>
220
+ <td>-</td>
221
+ <td>-</td>
222
+ <td>-</td>
223
+ </tr>
224
+ <tr>
225
+ <td nowrap="nowrap" align="left">GLM-4V-9B</td>
226
+ <td>13B</td>
227
+ <td>784</td>
228
+ <td>59.1</td>
229
+ <td>776</td>
230
+ <td>51.1</td>
231
+ <td>-</td>
232
+ <td>58.0</td>
233
+ <td>54.8</td>
234
+ <td>2018.8</td>
235
+ <td>67.9</td>
236
+ <td>71.2</td>
237
+ <td>46.9</td>
238
+ <td>45.0</td>
239
+ <td>-</td>
240
+ <td>-</td>
241
+ <td>-</td>
242
+ <td>-</td>
243
+ <td>-</td>
244
+ </tr>
245
+ <tr>
246
+ <td nowrap="nowrap" align="left">Pixtral-12B</td>
247
+ <td>12B</td>
248
+ <td>256</td>
249
+ <td>61.0</td>
250
+ <td>685</td>
251
+ <td>56.9</td>
252
+ <td>81.8</td>
253
+ <td>58.5</td>
254
+ <td>54.5</td>
255
+ <td>-</td>
256
+ <td>72.7</td>
257
+ <td>79.0</td>
258
+ <td>51.1</td>
259
+ <td>47.0</td>
260
+ <td>75.7</td>
261
+ <td>90.7</td>
262
+ <td>-</td>
263
+ <td>-</td>
264
+ <td>-</td>
265
+ </tr>
266
+ <tr>
267
+ <td nowrap="nowrap" align="left">DeepSeek-VL2-27B (4B)</td>
268
+ <td>27B</td>
269
+ <td>672</td>
270
+ <td>66.4</td>
271
+ <td>809</td>
272
+ <td>63.9</td>
273
+ <td>86.0</td>
274
+ <td>60.0</td>
275
+ <td>61.9</td>
276
+ <td>2253.0</td>
277
+ <td>81.2</td>
278
+ <td>83.8</td>
279
+ <td>54.0</td>
280
+ <td>45.3</td>
281
+ <td><u>84.2</u></td>
282
+ <td>93.3</td>
283
+ <td>-</td>
284
+ <td>-</td>
285
+ <td>3.0</td>
286
+ </tr>
287
+ <tr>
288
+ <td nowrap="nowrap" align="left">Qwen2-VL-7B</td>
289
+ <td>8B</td>
290
+ <td>784</td>
291
+ <td>67.1</td>
292
+ <td><u>866</u></td>
293
+ <td>58.2</td>
294
+ <td>83.0</td>
295
+ <td>62.0</td>
296
+ <td>60.7</td>
297
+ <td>2326.0</td>
298
+ <td>81.8</td>
299
+ <td>83.0</td>
300
+ <td>54.1</td>
301
+ <td>50.6</td>
302
+ <td><strong>84.3</strong></td>
303
+ <td><u>94.5</u></td>
304
+ <td>31.9</td>
305
+ <td>16.3</td>
306
+ <td>3.2</td>
307
+ </tr>
308
+ <tr>
309
+ <td nowrap="nowrap" align="left">LLaVA-OneVision-72B</td>
310
+ <td>72B</td>
311
+ <td>182</td>
312
+ <td>68.1</td>
313
+ <td>741</td>
314
+ <td>67.5</td>
315
+ <td>83.7</td>
316
+ <td>60.6</td>
317
+ <td><strong>65.8</strong></td>
318
+ <td>2261.0</td>
319
+ <td><strong>85.0</strong></td>
320
+ <td><u>85.6</u></td>
321
+ <td>56.8</td>
322
+ <td>49.0</td>
323
+ <td>80.5</td>
324
+ <td>91.3</td>
325
+ <td>39.1</td>
326
+ <td>-</td>
327
+ <td>3.5</td>
328
+ </tr>
329
+ <tr>
330
+ <td nowrap="nowrap" align="left">InternVL2.5-8B</td>
331
+ <td>8B</td>
332
+ <td>706</td>
333
+ <td>68.3</td>
334
+ <td>822</td>
335
+ <td><u>64.4</u></td>
336
+ <td>84.8</td>
337
+ <td>62.8</td>
338
+ <td>62.8</td>
339
+ <td>2344.0</td>
340
+ <td><u>83.6</u></td>
341
+ <td>84.5</td>
342
+ <td>56.0</td>
343
+ <td>50.1</td>
344
+ <td>79.1</td>
345
+ <td>93.0</td>
346
+ <td>39.5</td>
347
+ <td>19.7</td>
348
+ <td>3.4</td>
349
+ </tr>
350
+ <tr>
351
+ <td nowrap="nowrap" align="left">MiniCPM-V 2.6</td>
352
+ <td>8B</td>
353
+ <td><strong>2822</strong></td>
354
+ <td>65.2</td>
355
+ <td>852*</td>
356
+ <td>60.6</td>
357
+ <td>79.4</td>
358
+ <td>60.0</td>
359
+ <td>57.5</td>
360
+ <td><u>2348.4*</u></td>
361
+ <td>78.0</td>
362
+ <td>82.1</td>
363
+ <td>49.8*</td>
364
+ <td>48.1*</td>
365
+ <td>80.1</td>
366
+ <td>90.8</td>
367
+ <td>25.7</td>
368
+ <td>18.3</td>
369
+ <td>3.6</td>
370
+ </tr>
371
+ <tr>
372
+ <td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
373
+ <td>8B</td>
374
+ <td><strong>2822</strong></td>
375
+ <td><strong>70.2</strong></td>
376
+ <td><strong>897*</strong></td>
377
+ <td><strong>71.9*</strong></td>
378
+ <td><u>86.9*</u></td>
379
+ <td><u>67.5</u></td>
380
+ <td><u>64.0</u></td>
381
+ <td><strong>2372.0*</strong></td>
382
+ <td>80.5</td>
383
+ <td><strong>85.8</strong></td>
384
+ <td>50.4*</td>
385
+ <td><u>51.9</u></td>
386
+ <td>82.0</td>
387
+ <td>93.5</td>
388
+ <td><u>41.4*</u></td>
389
+ <td><u>23.1*</u></td>
390
+ <td><strong>3.8</strong></td>
391
+ </tr>
392
+ </tbody>
393
+ </table>
394
+ </div>
395
+ * We evaluate this benchmark using chain-of-thought prompting. Specifically, for MME, we used this technique only for the Cognition set.
396
+
397
+
398
+ <sup>+</sup> Token Density: number of pixels encoded into each visual token at maximum resolution, i.e., # pixels at maximum resolution / # visual tokens.
399
+
400
+ Note: For proprietary models, we calculate token density based on the image encoding charging strategy defined in the official API documentation, which provides an upper-bound estimation.
401
+
402
+
403
+ **Multi-image and Video Understanding:**
404
+
405
+ <details>
406
+ <summary>click to view</summary>
407
+ <div align="center">
408
+
409
+ <table style="margin: 0px auto;">
410
+ <thead>
411
+ <tr>
412
+ <th align="left">Model</th>
413
+ <th>Size</th>
414
+ <th>BLINK val</th>
415
+ <th>Mantis Eval</th>
416
+ <th>MIRB</th>
417
+ <th>Video-MME (wo / w subs)</th>
418
+ </tr>
419
+ </thead>
420
+ <tbody align="center">
421
+ <tr>
422
+ <td colspan="6" align="left"><strong>Proprietary</strong></td>
423
+ </tr>
424
+ <tr>
425
+ <td nowrap="nowrap" align="left">GPT-4o-20240513</td>
426
+ <td>-</td>
427
+ <td><strong>68.0</strong></td>
428
+ <td>-</td>
429
+ <td>-</td>
430
+ <td><strong>71.9/77.2<strong></td>
431
+ </tr>
432
+ <tr>
433
+ <td nowrap="nowrap" align="left">GPT4V</td>
434
+ <td>-</td>
435
+ <td>54.6</td>
436
+ <td>62.7</td>
437
+ <td>53.1</td>
438
+ <td>59.9/63.3</td>
439
+ </tr>
440
+ <tr>
441
+ <td colspan="6" align="left"><strong>Open-source</strong></td>
442
+ </tr>
443
+ <tr>
444
+ <td nowrap="nowrap" align="left">LLaVA-NeXT-Interleave 14B</td>
445
+ <td>14B</td>
446
+ <td>52.6</td>
447
+ <td>66.4</td>
448
+ <td>30.2</td>
449
+ <td>-</td>
450
+ </tr>
451
+ <tr>
452
+ <td nowrap="nowrap" align="left">LLaVA-OneVision-72B</td>
453
+ <td>72B</td>
454
+ <td>55.4</td>
455
+ <td><strong>77.6</strong></td>
456
+ <td>-</td>
457
+ <td><u>66.2/69.5</u></td>
458
+ </tr>
459
+ <tr>
460
+ <td nowrap="nowrap" align="left">MANTIS 8B</td>
461
+ <td>8B</td>
462
+ <td>49.1</td>
463
+ <td>59.5</td>
464
+ <td>34.8</td>
465
+ <td>-</td>
466
+ </tr>
467
+ <tr>
468
+ <td nowrap="nowrap" align="left">Qwen2-VL-7B</td>
469
+ <td>8B</td>
470
+ <td>53.2</td>
471
+ <td>69.6*</td>
472
+ <td><strong>67.6*</strong></td>
473
+ <td>63.3/69.0</td>
474
+ </tr>
475
+ <tr>
476
+ <td nowrap="nowrap" align="left">InternVL2.5-8B</td>
477
+ <td>8B</td>
478
+ <td>54.8</td>
479
+ <td>67.7</td>
480
+ <td>52.5</td>
481
+ <td>64.2/66.9</td>
482
+ </tr>
483
+ <tr>
484
+ <td nowrap="nowrap" align="left">MiniCPM-V 2.6</td>
485
+ <td>8B</td>
486
+ <td>53.0</td>
487
+ <td>69.1</td>
488
+ <td>53.8</td>
489
+ <td>60.9/63.6</td>
490
+ </tr>
491
+ <tr>
492
+ <td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
493
+ <td>8B</td>
494
+ <td><u>56.7</u></td>
495
+ <td><u>71.9</u></td>
496
+ <td><u>58.6</u></td>
497
+ <td>63.9/67.9</td>
498
+ </tr>
499
+ </tbody>
500
+ </table>
501
+
502
+ </div>
503
+ * We evaluate officially released checkpoints by ourselves.
504
+
505
+ </details>
506
+
507
+
508
+ #### Audio understanding and speech conversation results.
509
 
510
+ **Audio Understanding:**
511
 
512
+ <div align="center">
513
+ <table style="margin: 0px auto;">
514
+ <thead>
515
+ <tr>
516
+ <th align="left">Task</th>
517
+ <th>Size</th>
518
+ <th colspan="3">ASR (zh)</th>
519
+ <th colspan="3">ASR (en)</th>
520
+ <th colspan="2">AST</th>
521
+ <th>Emotion</th>
522
+ </tr>
523
+ <tr>
524
+ <th align="left">Metric</th>
525
+ <td></td>
526
+ <th colspan="3">CER↓</th>
527
+ <th colspan="3">WER↓</th>
528
+ <th colspan="2">BLEU↑</th>
529
+ <th>ACC↑</th>
530
+ </tr>
531
+ <tr>
532
+ <th align="left">Dataset</th>
533
+ <td></td>
534
+ <th>AISHELL-1</th>
535
+ <th>Fleurs zh</th>
536
+ <th>WenetSpeech test-net</th>
537
+ <th>LibriSpeech test-clean</th>
538
+ <th>GigaSpeech</th>
539
+ <th>TED-LIUM</th>
540
+ <th>CoVoST en2zh</th>
541
+ <th>CoVoST zh2en</th>
542
+ <th>MELD emotion</th>
543
+ </tr>
544
+ </thead>
545
+ <tbody align="center">
546
+ <tr>
547
+ <td colspan="11" align="left"><strong>Proprietary</strong></td>
548
+ </tr>
549
+ <tr>
550
+ <td nowrap="nowrap" align="left">GPT-4o-Realtime</td>
551
+ <td>-</td>
552
+ <td>7.3*</td>
553
+ <td><u>5.4*</u></td>
554
+ <td>28.9*</td>
555
+ <td>2.6*</td>
556
+ <td>12.9*</td>
557
+ <td>4.8*</td>
558
+ <td>37.1*</td>
559
+ <td>15.7*</td>
560
+ <td>33.2*</td>
561
+ </tr>
562
+ <tr>
563
+ <td nowrap="nowrap" align="left">Gemini 1.5 Pro</td>
564
+ <td>-</td>
565
+ <td>4.5*</td>
566
+ <td>5.9*</td>
567
+ <td>14.3*</td>
568
+ <td>2.9*</td>
569
+ <td>10.6*</td>
570
+ <td><strong>3.0*</strong></td>
571
+ <td><u>47.3*</u></td>
572
+ <td>22.6*</td>
573
+ <td>48.4*</td>
574
+ </tr>
575
+ <tr>
576
+ <td colspan="11" align="left"><strong>Open-Source</strong></td>
577
+ </tr>
578
+ <tr>
579
+ <td nowrap="nowrap" align="left">Qwen2-Audio-7B</td>
580
+ <td>8B</td>
581
+ <td>-</td>
582
+ <td>7.5</td>
583
+ <td>-</td>
584
+ <td><strong>1.6</strong></td>
585
+ <td>-</td>
586
+ <td>-</td>
587
+ <td>45.2</td>
588
+ <td><u>24.4</u></td>
589
+ <td><strong>55.3</strong></td>
590
+ </tr>
591
+ <tr>
592
+ <td nowrap="nowrap" align="left">Qwen2-Audio-7B-Instruct</td>
593
+ <td>8B</td>
594
+ <td>2.6*</td>
595
+ <td>6.9*</td>
596
+ <td><u>10.3*</u></td>
597
+ <td>3.1*</td>
598
+ <td><u>9.7</u>*</td>
599
+ <td>5.9*</td>
600
+ <td>39.5*</td>
601
+ <td>22.9*</td>
602
+ <td>17.4*</td>
603
+ </tr>
604
+ <tr>
605
+ <td nowrap="nowrap" align="left">GLM-4-Voice-Base</td>
606
+ <td>9B</td>
607
+ <td><u>2.5</u></td>
608
+ <td>-</td>
609
+ <td>-</td>
610
+ <td>2.8</td>
611
+ <td>-</td>
612
+ <td>-</td>
613
+ <td>-</td>
614
+ <td>-</td>
615
+ </tr>
616
+ <tr>
617
+ <td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
618
+ <td>8B</td>
619
+ <td><strong>1.6</strong></td>
620
+ <td><strong>4.4</strong></td>
621
+ <td><strong>6.9</strong></td>
622
+ <td><u>1.7</u></td>
623
+ <td><strong>8.7</strong></td>
624
+ <td><strong>3.0</strong></td>
625
+ <td><strong>48.2</strong></td>
626
+ <td><strong>27.2</strong></td>
627
+ <td><u>52.4</u></td>
628
+ </tr>
629
+ </tbody>
630
+ </table>
631
+ </div>
632
+ * We evaluate officially released checkpoints by ourselves.<br><br>
633
 
634
+ **Speech Generation:**
 
 
 
 
635
 
636
+ <div align="center">
637
+ <table style="margin: 0px auto;">
638
+ <thead>
639
+ <tr>
640
+ <th align="left">Task</th>
641
+ <th>Size</th>
642
+ <th colspan="9">SpeechQA</th>
643
+ </tr>
644
+ <tr>
645
+ <th align="left">Metric</th>
646
+ <th></th>
647
+ <th colspan="3">ACC↑</th>
648
+ <th>G-Eval (10 point)↑</th>
649
+ <th>Semantic ELO score↑</th>
650
+ <th>Acoustic ELO score↑</th>
651
+ <th>Overall ELO score↑</th>
652
+ <th>UTMOS↑</th>
653
+ <th>ASR-WER↓</th>
654
+ </tr>
655
+ <tr>
656
+ <th align="left">Dataset</th>
657
+ <th></th>
658
+ <th>Speech Llama Q.</th>
659
+ <th>Speech Web Q.</th>
660
+ <th>Speech Trivia QA</th>
661
+ <th>Speech AlpacaEval</th>
662
+ <th colspan="5">AudioArena</th>
663
+ </tr>
664
+ </thead>
665
+ <tbody align="center">
666
+ <tr>
667
+ <td colspan="11" align="left"><strong>Proprietary</strong></td>
668
+ </tr>
669
+ <tr>
670
+ <td nowrap="nowrap" align="left">GPT-4o-Realtime</td>
671
+ <td></td>
672
+ <td><strong>71.7</strong></td>
673
+ <td><strong>51.6</strong></td>
674
+ <td><strong>69.7</strong></td>
675
+ <td><strong>7.4</strong></td>
676
+ <td><strong>1157</strong></td>
677
+ <td><strong>1203</strong></td>
678
+ <td><strong>1200</strong></td>
679
+ <td><strong>4.2</strong></td>
680
+ <td><strong>2.3</strong></td>
681
+ </tr>
682
+ <tr>
683
+ <td colspan="11" align="left"><strong>Open-Source</strong></td>
684
+ </tr>
685
+ <tr>
686
+ <td nowrap="nowrap" align="left">GLM-4-Voice</td>
687
+ <td>9B</td>
688
+ <td>50.0</td>
689
+ <td>32.0</td>
690
+ <td>36.4</td>
691
+ <td><u>5.1</u></td>
692
+ <td>999</td>
693
+ <td>1147</td>
694
+ <td>1035</td>
695
+ <td><u>4.1</u></td>
696
+ <td><u>11.7</u></td>
697
+ </tr>
698
+ <tr>
699
+ <td nowrap="nowrap" align="left">Llama-Omni</td>
700
+ <td>8B</td>
701
+ <td>45.3</td>
702
+ <td>22.9</td>
703
+ <td>10.7</td>
704
+ <td>3.9</td>
705
+ <td>960</td>
706
+ <td>878</td>
707
+ <td>897</td>
708
+ <td>3.2</td>
709
+ <td>24.3</td>
710
+ </tr>
711
+ <tr>
712
+ <td nowrap="nowrap" align="left">Moshi</td>
713
+ <td>7B</td>
714
+ <td>43.7</td>
715
+ <td>23.8</td>
716
+ <td>16.7</td>
717
+ <td>2.4</td>
718
+ <td>871</td>
719
+ <td>808</td>
720
+ <td>875</td>
721
+ <td>2.8</td>
722
+ <td>8.2</td>
723
+ </tr>
724
+ <tr>
725
+ <td nowrap="nowrap" align="left">Mini-Omni</td>
726
+ <td>1B</td>
727
+ <td>22.0</td>
728
+ <td>12.8</td>
729
+ <td>6.9</td>
730
+ <td>2.5</td>
731
+ <td>926</td>
732
+ <td>803</td>
733
+ <td>865</td>
734
+ <td>3.4</td>
735
+ <td>10.0</td>
736
+ </tr>
737
+ <tr>
738
+ <td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
739
+ <td>8B</td>
740
+ <td><u>61.0</u></td>
741
+ <td><u>40.0</u></td>
742
+ <td><u>40.2</u></td>
743
+ <td><u>5.1</u></td>
744
+ <td><u>1088</u></td>
745
+ <td><u>1163</u></td>
746
+ <td><u>1131</u></td>
747
+ <td><strong>4.2</strong></td>
748
+ <td>9.8</td>
749
+ </tr>
750
+ </tbody>
751
+ </table>
752
+ </div>
753
+ All results are from AudioEvals, and the evaluation methods along with further details can be found in <a href="https://github.com/OpenBMB/UltraEval-Audio" target="_blank">UltraEval-Audio</a>.<br><br>
754
 
755
+ **End-to-end Voice Cloning**
 
 
 
 
 
 
 
756
 
757
+ <div align="center">
758
+ <table style="margin: 0px auto;">
759
+ <thead>
760
+ <tr>
761
+ <th align="left">Task</th>
762
+ <th colspan="2">Voice cloning</th>
763
+ </tr>
764
+ <tr>
765
+ <th align="left">Metric</th>
766
+ <th>SIMO↑</th>
767
+ <th>SIMO↑</th>
768
+ </tr>
769
+ <tr>
770
+ <th align="left">Dataset</th>
771
+ <th>Seed-TTS test-zh</th>
772
+ <th>Seed-TTS test-en</th>
773
+ </tr>
774
+ </thead>
775
+ <tbody align="center">
776
+ <tr>
777
+ <td nowrap="nowrap" align="left">F5-TTS</td>
778
+ <td><strong>76</strong></td>
779
+ <td><strong>67</strong></td>
780
+ </tr>
781
+ <tr>
782
+ <td nowrap="nowrap" align="left">CosyVoice</td>
783
+ <td><u>75</u></td>
784
+ <td><u>64</u></td>
785
+ </tr>
786
+ <tr>
787
+ <td nowrap="nowrap" align="left">FireRedTTS</td>
788
+ <td>63</td>
789
+ <td>46</td>
790
+ </tr>
791
+ <tr>
792
+ <td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
793
+ <td>57</td>
794
+ <td>47</td>
795
+ </tr>
796
+ </tbody>
797
+ </table>
798
+ </div>
799
 
 
 
 
 
 
 
 
800
 
801
+ #### Multimodal live streaming results.
802
+
803
+ **Multimodal Live Streaming:** results on StreamingBench
804
 
805
+ <table style="margin: 0px auto;">
806
+ <thead>
807
+ <tr>
808
+ <th align="left">Model</th>
809
+ <th>Size</th>
810
+ <th>Real-Time Video Understanding</th>
811
+ <th>Omni-Source Understanding</th>
812
+ <th>Contextual Understanding</th>
813
+ <th>Overall</th>
814
+ </tr>
815
+ </thead>
816
+ <tbody align="center">
817
+ <tr>
818
+ <td colspan="7" align="left"><strong>Proprietary</strong></td>
819
+ </tr>
820
+ <tr>
821
+ <td nowrap="nowrap" align="left">Gemini 1.5 Pro</td>
822
+ <td>-</td>
823
+ <td><u>77.4</u></td>
824
+ <td><strong>67.8</strong></td>
825
+ <td><strong>51.1</strong></td>
826
+ <td><strong>70.3</strong></td>
827
+ </tr>
828
+ <tr>
829
+ <td nowrap="nowrap" align="left">GPT-4o-202408</td>
830
+ <td>-</td>
831
+ <td>74.5</td>
832
+ <td>51.0</td>
833
+ <td><u>48.0</u></td>
834
+ <td>64.1</td>
835
+ </tr>
836
+ <tr>
837
+ <td nowrap="nowrap" align="left">Claude-3.5-Sonnet</td>
838
+ <td>-</td>
839
+ <td>74.0</td>
840
+ <td>41.4</td>
841
+ <td>37.8</td>
842
+ <td>59.7</td>
843
+ </tr>
844
+ <tr>
845
+ <td colspan="9" align="left"><strong>Open-source</strong></td>
846
+ </tr>
847
+ <tr>
848
+ <td nowrap="nowrap" align="left">VILA-1.5</td>
849
+ <td>8B</td>
850
+ <td>61.5</td>
851
+ <td>37.5</td>
852
+ <td>26.7</td>
853
+ <td>49.5</td>
854
+ </tr>
855
+ <tr>
856
+ <td nowrap="nowrap" align="left">LongVA</td>
857
+ <td>7B</td>
858
+ <td>63.1</td>
859
+ <td>35.9</td>
860
+ <td>30.2</td>
861
+ <td>50.7</td>
862
+ </tr>
863
+ <tr>
864
+ <td nowrap="nowrap" align="left">LLaVA-Next-Video-34B</td>
865
+ <td>34B</td>
866
+ <td>69.8</td>
867
+ <td>41.7</td>
868
+ <td>34.3</td>
869
+ <td>56.7</td>
870
+ </tr>
871
+ <tr>
872
+ <td nowrap="nowrap" align="left">Qwen2-VL-7B</td>
873
+ <td>8B</td>
874
+ <td>71.2</td>
875
+ <td>40.7</td>
876
+ <td>33.1</td>
877
+ <td>57.0</td>
878
+ </tr>
879
+ <tr>
880
+ <td nowrap="nowrap" align="left">InternVL2-8B</td>
881
+ <td>8B</td>
882
+ <td>70.1</td>
883
+ <td>42.7</td>
884
+ <td>34.1</td>
885
+ <td>57.0</td>
886
+ </tr>
887
+ <tr>
888
+ <td nowrap="nowrap" align="left">VITA-1.5</td>
889
+ <td>8B</td>
890
+ <td>70.9</td>
891
+ <td>40.8</td>
892
+ <td>35.8</td>
893
+ <td>57.4</td>
894
+ </tr>
895
+ <tr>
896
+ <td nowrap="nowrap" align="left">LLaVA-OneVision-7B</td>
897
+ <td>8B</td>
898
+ <td>74.3</td>
899
+ <td>40.8</td>
900
+ <td>31.0</td>
901
+ <td>58.4</td>
902
+ </tr>
903
+ <tr>
904
+ <td nowrap="nowrap" align="left">InternLM-XC2.5-OL-7B</td>
905
+ <td>8B</td>
906
+ <td>75.4</td>
907
+ <td>46.2</td>
908
+ <td>33.6</td>
909
+ <td>60.8</td>
910
+ </tr>
911
+ <tr>
912
+ <td nowrap="nowrap" align="left">MiniCPM-V 2.6</td>
913
+ <td>8B</td>
914
+ <td>72.4</td>
915
+ <td>40.2</td>
916
+ <td>33.4</td>
917
+ <td>57.7</td>
918
+ </tr>
919
+ <tr>
920
+ <td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
921
+ <td>8B</td>
922
+ <td><strong>79.9</strong></td>
923
+ <td><u>53.4</u></td>
924
+ <td>38.5</td>
925
+ <td><u>66.0</u></td>
926
+ </tr>
927
+ </tbody>
928
+ </table>
929
 
 
 
 
930
 
931
+
932
+ ### Examples <!-- omit in toc -->
933
+
934
+ We deploy MiniCPM-o 2.6 on end devices. The demo video is the raw-speed recording on an iPad Pro and a Web demo.
935
+
936
+ <div align="center">
937
+ <a href="https://youtu.be/JFJg9KZ_iZk"><img src="https://github.com/OpenBMB/MiniCPM-o/raw/main/assets/o-2dot6-demo-video-preview.png", width=70%></a>
938
+ </div>
939
+
940
+ <br>
941
+
942
+
943
+ <div style="display: flex; flex-direction: column; align-items: center;">
944
+ <img src="https://github.com/OpenBMB/MiniCPM-o/raw/main/assets/minicpmo2_6/minicpmo2_6_math_intersect.png" alt="math" style="margin-bottom: 5px;">
945
+ <img src="https://github.com/OpenBMB/MiniCPM-o/raw/main/assets/minicpmo2_6/minicpmo2_6_diagram_train_NN.png" alt="diagram" style="margin-bottom: 5px;">
946
+ <img src="https://github.com/OpenBMB/MiniCPM-o/raw/main/assets/minicpmo2_6/minicpmo2_6_multi-image_bike.png" alt="bike" style="margin-bottom: 5px;">
947
+ </div>
948
+
949
+
950
+
951
+
952
+ ## Online Demo
953
+ Click here to try the online demo of [MiniCPM-o 2.6](https://minicpm-omni-webdemo-us.modelbest.cn).
954
+
955
 
956
  ## Usage
957
+ Inference using Huggingface transformers on NVIDIA GPUs. Please ensure that `transformers==4.44.2` is installed, as other versions may have compatibility issues. We are investigating this issue. Requirements tested on python 3.10:
958
+ ```
959
+ Pillow==10.1.0
960
+ torch==2.3.1
961
+ torchaudio==2.3.1
962
+ torchvision==0.18.1
963
+ transformers==4.44.2
964
+ librosa==0.9.0
965
+ soundfile==0.12.1
966
+ vector-quantize-pytorch==1.18.5
967
+ vocos==0.1.0
968
+ decord
969
+ moviepy
970
+ ```
971
 
 
972
 
973
+ ### Model initialization
974
  ```python
 
 
975
 
976
+ import torch
977
+ from PIL import Image
978
+ from transformers import AutoModel, AutoTokenizer
979
 
980
+ # load omni model default, the default init_vision/init_audio/init_tts is True
981
+ # if load vision-only model, please set init_audio=False and init_tts=False
982
+ # if load audio-only model, please set init_vision=False
983
+ model = AutoModel.from_pretrained(
984
+ 'openbmb/MiniCPM-o-2_6',
985
  trust_remote_code=True,
986
+ attn_implementation='sdpa', # sdpa or flash_attention_2
987
+ torch_dtype=torch.bfloat16,
988
+ init_vision=True,
989
+ init_audio=True,
990
+ init_tts=True
991
+ )
992
+
993
+
994
+ model = model.eval().cuda()
995
+ tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True)
996
+
997
+ # In addition to vision-only mode, tts processor and vocos also needs to be initialized
998
+ model.init_tts()
999
+ ```
1000
+
1001
+ If you are using an older version of PyTorch, you might encounter this issue `"weight_norm_fwd_first_dim_kernel" not implemented for 'BFloat16'`, Please convert the TTS to float32 type.
1002
+ ```python
1003
+ model.tts.float()
1004
+ ```
1005
+
1006
+ ### Omni mode
1007
+ We provide two inference modes: chat and streaming
1008
+
1009
+ #### Chat inference
1010
+ ```python
1011
+ import math
1012
+ import numpy as np
1013
+ from PIL import Image
1014
+ from moviepy.editor import VideoFileClip
1015
+ import tempfile
1016
+ import librosa
1017
+ import soundfile as sf
1018
+
1019
+ def get_video_chunk_content(video_path, flatten=True):
1020
+ video = VideoFileClip(video_path)
1021
+ print('video_duration:', video.duration)
1022
+
1023
+ with tempfile.NamedTemporaryFile(suffix=".wav", delete=True) as temp_audio_file:
1024
+ temp_audio_file_path = temp_audio_file.name
1025
+ video.audio.write_audiofile(temp_audio_file_path, codec="pcm_s16le", fps=16000)
1026
+ audio_np, sr = librosa.load(temp_audio_file_path, sr=16000, mono=True)
1027
+ num_units = math.ceil(video.duration)
1028
+
1029
+ # 1 frame + 1s audio chunk
1030
+ contents= []
1031
+ for i in range(num_units):
1032
+ frame = video.get_frame(i+1)
1033
+ image = Image.fromarray((frame).astype(np.uint8))
1034
+ audio = audio_np[sr*i:sr*(i+1)]
1035
+ if flatten:
1036
+ contents.extend(["<unit>", image, audio])
1037
+ else:
1038
+ contents.append(["<unit>", image, audio])
1039
+
1040
+ return contents
1041
+
1042
+ video_path="assets/Skiing.mp4"
1043
+ # if use voice clone prompt, please set ref_audio
1044
+ ref_audio_path = 'assets/demo.wav'
1045
+ ref_audio, _ = librosa.load(ref_audio_path, sr=16000, mono=True)
1046
+ sys_msg = model.get_sys_prompt(ref_audio=ref_audio, mode='omni', language='en')
1047
+ # or use default prompt
1048
+ # sys_msg = model.get_sys_prompt(mode='omni', language='en')
1049
+
1050
+ contents = get_video_chunk_content(video_path)
1051
+ msg = {"role":"user", "content": contents}
1052
+ msgs = [sys_msg, msg]
1053
+
1054
+ # please set generate_audio=True and output_audio_path to save the tts result
1055
+ generate_audio = True
1056
+ output_audio_path = 'output.wav'
1057
+
1058
+ res = model.chat(
1059
+ msgs=msgs,
1060
+ tokenizer=tokenizer,
1061
+ sampling=True,
1062
+ temperature=0.5,
1063
+ max_new_tokens=4096,
1064
+ omni_input=True, # please set omni_input=True when omni inference
1065
+ use_tts_template=True,
1066
+ generate_audio=generate_audio,
1067
+ output_audio_path=output_audio_path,
1068
+ max_slice_nums=1,
1069
+ use_image_id=False,
1070
+ return_dict=True
1071
+ )
1072
+ print(res)
1073
+
1074
+ ## You will get the answer: The person in the picture is skiing down a snowy slope.
1075
+ # import IPython
1076
+ # IPython.display.Audio('output.wav')
1077
+
1078
+ ```
1079
+ #### Streaming inference
1080
+ ```python
1081
+ # a new conversation need reset session first, it will reset the kv-cache
1082
+ model.reset_session()
1083
+
1084
+ contents = get_video_chunk_content(video_path, flatten=False)
1085
+ session_id = '123'
1086
+ generate_audio = True
1087
+
1088
+ # 1. prefill system prompt
1089
+ res = model.streaming_prefill(
1090
+ session_id=session_id,
1091
+ msgs=[sys_msg],
1092
+ tokenizer=tokenizer
1093
  )
1094
 
1095
+ # 2. prefill video/audio chunks
1096
+ for content in contents:
1097
+ msgs = [{"role":"user", "content": content}]
1098
+ res = model.streaming_prefill(
1099
+ session_id=session_id,
1100
+ msgs=msgs,
1101
+ tokenizer=tokenizer
1102
+ )
1103
+
1104
+ # 3. generate
1105
+ res = model.streaming_generate(
1106
+ session_id=session_id,
1107
+ tokenizer=tokenizer,
1108
+ temperature=0.5,
1109
+ generate_audio=generate_audio
1110
  )
1111
 
1112
+ audios = []
1113
+ text = ""
 
1114
 
1115
+ if generate_audio:
1116
+ for r in res:
1117
+ audio_wav = r.audio_wav
1118
+ sampling_rate = r.sampling_rate
1119
+ txt = r.text
 
 
1120
 
1121
+ audios.append(audio_wav)
1122
+ text += txt
1123
+
1124
+ res = np.concatenate(audios)
1125
+ sf.write("output.wav", res, samplerate=sampling_rate)
1126
+ print("text:", text)
1127
+ print("audio saved to output.wav")
1128
+ else:
1129
+ for r in res:
1130
+ text += r['text']
1131
+ print("text:", text)
1132
 
 
1133
  ```
1134
 
1135
+
1136
+ ### Speech and Audio Mode
1137
+
1138
+ Model initialization
1139
 
1140
  ```python
1141
+ import torch
1142
+ import librosa
1143
+ from transformers import AutoModel, AutoTokenizer
1144
 
1145
+ model = AutoModel.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True,
1146
+ attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
1147
+ model = model.eval().cuda()
1148
+ tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True)
1149
 
1150
+ model.init_tts()
1151
+ model.tts.float()
1152
+ ```
1153
+
1154
+ <hr/>
1155
+
1156
+ #### Mimick
1157
+
1158
+ `Mimick` task reflects a model's end-to-end speech modeling capability. The model takes audio input, and outputs an ASR transcription and subsequently reconstructs the original audio with high similarity. The higher the similarity between the reconstructed audio and the original audio, the stronger the model's foundational capability in end-to-end speech modeling.
1159
+
1160
+ ```python
1161
+ mimick_prompt = "Please repeat each user's speech, including voice style and speech content."
1162
+ audio_input, _ = librosa.load('./assets/input_examples/Trump_WEF_2018_10s.mp3', sr=16000, mono=True) # load the audio to be mimicked
1163
+
1164
+ # can also try `./assets/input_examples/cxk_original.wav`,
1165
+ # `./assets/input_examples/fast-pace.wav`,
1166
+ # `./assets/input_examples/chi-english-1.wav`
1167
+ # `./assets/input_examples/exciting-emotion.wav`
1168
+ # for different aspects of speech-centric features.
1169
+
1170
+ msgs = [{'role': 'user', 'content': [mimick_prompt, audio_input]}]
1171
+ res = model.chat(
1172
+ msgs=msgs,
1173
+ tokenizer=tokenizer,
1174
+ sampling=True,
1175
+ max_new_tokens=128,
1176
+ use_tts_template=True,
1177
+ temperature=0.3,
1178
+ generate_audio=True,
1179
+ output_audio_path='output_mimick.wav', # save the tts result to output_audio_path
1180
+ )
1181
+ ```
1182
+
1183
+ <hr/>
1184
+
1185
+ #### General Speech Conversation with Configurable Voices
1186
+
1187
+ A general usage scenario of `MiniCPM-o-2.6` is role-playing a specific character based on the audio prompt. It will mimic the voice of the character to some extent and act like the character in text, including language style. In this mode, `MiniCPM-o-2.6` sounds **more natural and human-like**. Self-defined audio prompts can be used to customize the voice of the character in an end-to-end manner.
1188
+
1189
+
1190
+ ```python
1191
+ ref_audio, _ = librosa.load('./assets/input_examples/icl_20.wav', sr=16000, mono=True) # load the reference audio
1192
+ sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='audio_roleplay', language='en')
1193
+
1194
+ # round one
1195
+ user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]}
1196
+ msgs = [sys_prompt, user_question]
1197
+ res = model.chat(
1198
+ msgs=msgs,
1199
+ tokenizer=tokenizer,
1200
+ sampling=True,
1201
+ max_new_tokens=128,
1202
+ use_tts_template=True,
1203
+ generate_audio=True,
1204
+ temperature=0.3,
1205
+ output_audio_path='result_roleplay_round_1.wav',
1206
  )
1207
 
1208
+ # round two
1209
+ history = msgs.append({'role': 'assistant', 'content': res})
1210
+ user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]}
1211
+ msgs = history.append(user_question)
1212
+ res = model.chat(
1213
+ msgs=msgs,
1214
+ tokenizer=tokenizer,
1215
+ sampling=True,
1216
+ max_new_tokens=128,
1217
+ use_tts_template=True,
1218
+ generate_audio=True,
1219
+ temperature=0.3,
1220
+ output_audio_path='result_roleplay_round_2.wav',
1221
  )
1222
+ print(res)
1223
  ```
1224
 
1225
+ <hr/>
1226
+
1227
+ #### Speech Conversation as an AI Assistant
1228
+
1229
+ An enhanced feature of `MiniCPM-o-2.6` is to act as an AI assistant, but only with limited choice of voices. In this mode, `MiniCPM-o-2.6` is **less human-like and more like a voice assistant**. In this mode, the model is more instruction-following. For demo, you are suggested to use `assistant_female_voice`, `assistant_male_voice`, and `assistant_default_female_voice`. Other voices may work but not as stable as the default voices.
1230
+
1231
+ *Please note that, `assistant_female_voice` and `assistant_male_voice` are more stable but sounds like robots, while `assistant_default_female_voice` is more human-alike but not stable, its voice often changes in multiple turns. We suggest you to try stable voices `assistant_female_voice` and `assistant_male_voice`.*
1232
+
1233
+ ```python
1234
+ ref_audio, _ = librosa.load('./assets/input_examples/assistant_female_voice.wav', sr=16000, mono=True) # or use `./assets/input_examples/assistant_male_voice.wav`
1235
+ sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='audio_assistant', language='en')
1236
+ user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]} # load the user's audio question
1237
+
1238
+ # round one
1239
+ msgs = [sys_prompt, user_question]
1240
+ res = model.chat(
1241
+ msgs=msgs,
1242
+ tokenizer=tokenizer,
1243
+ sampling=True,
1244
+ max_new_tokens=128,
1245
+ use_tts_template=True,
1246
+ generate_audio=True,
1247
+ temperature=0.3,
1248
+ output_audio_path='result_assistant_round_1.wav',
1249
+ )
1250
 
1251
+ # round two
1252
+ history = msgs.append({'role': 'assistant', 'content': res})
1253
+ user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]}
1254
+ msgs = history.append(user_question)
1255
+ res = model.chat(
1256
+ msgs=msgs,
1257
+ tokenizer=tokenizer,
1258
+ sampling=True,
1259
+ max_new_tokens=128,
1260
+ use_tts_template=True,
1261
+ generate_audio=True,
1262
+ temperature=0.3,
1263
+ output_audio_path='result_assistant_round_2.wav',
1264
+ )
1265
+ print(res)
1266
  ```
1267
 
1268
+ <hr/>
1269
+
1270
+ #### Instruction-to-Speech
1271
 
1272
+ `MiniCPM-o-2.6` can also do Instruction-to-Speech, aka **Voice Creation**. You can describe a voice in detail, and the model will generate a voice that matches the description. For more Instruction-to-Speech sample instructions, you can refer to https://voxinstruct.github.io/VoxInstruct/.
 
 
 
 
 
1273
 
1274
+ ```python
1275
+ instruction = 'Speak like a male charming superstar, radiating confidence and style in every word.'
 
 
 
1276
 
1277
+ msgs = [{'role': 'user', 'content': [instruction]}]
1278
 
1279
+ res = model.chat(
1280
+ msgs=msgs,
1281
+ tokenizer=tokenizer,
1282
+ sampling=True,
1283
+ max_new_tokens=128,
1284
+ use_tts_template=True,
1285
+ generate_audio=True,
1286
+ temperature=0.3,
1287
+ output_audio_path='result_voice_creation.wav',
1288
+ )
1289
+ ```
1290
 
1291
+ <hr/>
1292
 
1293
+ #### Voice Cloning
1294
 
1295
+ `MiniCPM-o-2.6` can also do zero-shot text-to-speech, aka **Voice Cloning**. With this mode, model will act like a TTS model.
 
 
 
 
 
 
1296
 
 
1297
 
1298
+ ```python
1299
+ ref_audio, _ = librosa.load('./assets/input_examples/icl_20.wav', sr=16000, mono=True) # load the reference audio
1300
+ sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='voice_cloning', language='en')
1301
+ text_prompt = f"Please read the text below."
1302
+ user_question = {'role': 'user', 'content': [text_prompt, "content that you want to read"]}
 
 
 
1303
 
1304
+ msgs = [sys_prompt, user_question]
1305
+ res = model.chat(
1306
+ msgs=msgs,
1307
+ tokenizer=tokenizer,
1308
+ sampling=True,
1309
+ max_new_tokens=128,
1310
+ use_tts_template=True,
1311
+ generate_audio=True,
1312
+ temperature=0.3,
1313
+ output_audio_path='result_voice_cloning.wav',
1314
+ )
1315
 
1316
+ ```
1317
 
1318
+ <hr/>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1319
 
1320
+ #### Addressing Various Audio Understanding Tasks
1321
 
1322
+ `MiniCPM-o-2.6` can also be used to address various audio understanding tasks, such as ASR, speaker analysis, general audio captioning, and sound scene tagging.
 
1323
 
1324
+ For audio-to-text tasks, you can use the following prompts:
1325
 
1326
+ - ASR with ZH(same as AST en2zh): `请仔细听这段音频片段,并将其内容逐字记录。`
1327
+ - ASR with EN(same as AST zh2en): `Please listen to the audio snippet carefully and transcribe the content.`
1328
+ - Speaker Analysis: `Based on the speaker's content, speculate on their gender, condition, age range, and health status.`
1329
+ - General Audio Caption: `Summarize the main content of the audio.`
1330
+ - General Sound Scene Tagging: `Utilize one keyword to convey the audio's content or the associated scene.`
1331
 
1332
+ ```python
1333
+ task_prompt = "Please listen to the audio snippet carefully and transcribe the content." + "\n" # can change to other prompts.
1334
+ audio_input, _ = librosa.load('./assets/input_examples/audio_understanding.mp3', sr=16000, mono=True) # load the audio to be captioned
1335
 
1336
+ msgs = [{'role': 'user', 'content': [task_prompt, audio_input]}]
1337
 
1338
+ res = model.chat(
1339
+ msgs=msgs,
1340
+ tokenizer=tokenizer,
1341
+ sampling=True,
1342
+ max_new_tokens=128,
1343
+ use_tts_template=True,
1344
+ generate_audio=True,
1345
+ temperature=0.3,
1346
+ output_audio_path='result_audio_understanding.wav',
1347
+ )
1348
+ print(res)
1349
+ ```
1350
+
1351
+
1352
+ ### Vision-Only mode
1353
+
1354
+ `MiniCPM-o-2_6` has the same inference methods as `MiniCPM-V-2_6`
1355
+
1356
+ #### Chat with single image
1357
+ ```python
1358
+ # test.py
1359
+ image = Image.open('xx.jpg').convert('RGB')
1360
+ question = 'What is in the image?'
1361
+ msgs = [{'role': 'user', 'content': [image, question]}]
1362
+ res = model.chat(
1363
+ image=None,
1364
+ msgs=msgs,
1365
+ tokenizer=tokenizer
1366
+ )
1367
+ print(res)
1368
+
1369
+ ## if you want to use streaming, please make sure sampling=True and stream=True
1370
+ ## the model.chat will return a generator
1371
+ res = model.chat(
1372
+ msgs=msgs,
1373
+ tokenizer=tokenizer,
1374
+ sampling=True,
1375
+ stream=True
1376
+ )
1377
+ generated_text = ""
1378
+ for new_text in res:
1379
+ generated_text += new_text
1380
+ print(new_text, flush=True, end='')
1381
+ ```
1382
+
1383
+ #### Chat with multiple images
1384
+ <details>
1385
+ <summary> Click to show Python code running MiniCPM-o 2.6 with multiple images input. </summary>
1386
+
1387
+ ```python
1388
+ image1 = Image.open('image1.jpg').convert('RGB')
1389
+ image2 = Image.open('image2.jpg').convert('RGB')
1390
+ question = 'Compare image 1 and image 2, tell me about the differences between image 1 and image 2.'
1391
+ msgs = [{'role': 'user', 'content': [image1, image2, question]}]
1392
+ answer = model.chat(
1393
+ msgs=msgs,
1394
+ tokenizer=tokenizer
1395
+ )
1396
+ print(answer)
1397
+ ```
1398
+ </details>
1399
+
1400
+ #### In-context few-shot learning
1401
+ <details>
1402
+ <summary> Click to view Python code running MiniCPM-o 2.6 with few-shot input. </summary>
1403
+
1404
+ ```python
1405
+ question = "production date"
1406
+ image1 = Image.open('example1.jpg').convert('RGB')
1407
+ answer1 = "2023.08.04"
1408
+ image2 = Image.open('example2.jpg').convert('RGB')
1409
+ answer2 = "2007.04.24"
1410
+ image_test = Image.open('test.jpg').convert('RGB')
1411
+ msgs = [
1412
+ {'role': 'user', 'content': [image1, question]}, {'role': 'assistant', 'content': [answer1]},
1413
+ {'role': 'user', 'content': [image2, question]}, {'role': 'assistant', 'content': [answer2]},
1414
+ {'role': 'user', 'content': [image_test, question]}
1415
+ ]
1416
+ answer = model.chat(
1417
+ msgs=msgs,
1418
+ tokenizer=tokenizer
1419
+ )
1420
+ print(answer)
1421
+ ```
1422
+ </details>
1423
+
1424
+ #### Chat with video
1425
+ <details>
1426
+ <summary> Click to view Python code running MiniCPM-o 2.6 with video input. </summary>
1427
+
1428
+ ```python
1429
+ MAX_NUM_FRAMES=64 # if cuda OOM set a smaller number
1430
+ def encode_video(video_path):
1431
+ def uniform_sample(l, n):
1432
+ gap = len(l) / n
1433
+ idxs = [int(i * gap + gap / 2) for i in range(n)]
1434
+ return [l[i] for i in idxs]
1435
+ vr = VideoReader(video_path, ctx=cpu(0))
1436
+ sample_fps = round(vr.get_avg_fps() / 1) # FPS
1437
+ frame_idx = [i for i in range(0, len(vr), sample_fps)]
1438
+ if len(frame_idx) > MAX_NUM_FRAMES:
1439
+ frame_idx = uniform_sample(frame_idx, MAX_NUM_FRAMES)
1440
+ frames = vr.get_batch(frame_idx).asnumpy()
1441
+ frames = [Image.fromarray(v.astype('uint8')) for v in frames]
1442
+ print('num frames:', len(frames))
1443
+ return frames
1444
+ video_path ="video_test.mp4"
1445
+ frames = encode_video(video_path)
1446
+ question = "Describe the video"
1447
+ msgs = [
1448
+ {'role': 'user', 'content': frames + [question]},
1449
+ ]
1450
+ # Set decode params for video
1451
+ params={}
1452
+ params["use_image_id"] = False
1453
+ params["max_slice_nums"] = 2 # use 1 if cuda OOM and video resolution > 448*448
1454
+ answer = model.chat(
1455
+ msgs=msgs,
1456
+ tokenizer=tokenizer,
1457
+ **params
1458
+ )
1459
+ print(answer)
1460
  ```
1461
+ </details>
1462
+
1463
+ Please look at [GitHub](https://github.com/OpenBMB/MiniCPM-o) for more detail about usage.
1464
+
1465
+
1466
+ ## Inference with llama.cpp<a id="llamacpp"></a>
1467
+ MiniCPM-o 2.6 (vision-only mode) can run with llama.cpp. See our fork of [llama.cpp](https://github.com/OpenBMB/llama.cpp/tree/minicpm-omni) and [readme](https://github.com/OpenBMB/llama.cpp/blob/minicpm-omni/examples/llava/README-minicpmo2.6.md) for more detail.
1468
+
1469
+
1470
+ ## Int4 quantized version
1471
+ Download the int4 quantized version for lower GPU memory (7GB) usage: [MiniCPM-o-2_6-int4](https://huggingface.co/openbmb/MiniCPM-o-2_6-int4).
1472
+
1473
 
1474
+ ## License
1475
+ #### Model License
1476
+ * The MiniCPM-o/V model weights and code are open-sourced under the [Apache-2.0](https://github.com/OpenBMB/MiniCPM-V/blob/main/LICENSE) license.
1477
+ * To help us better understand and support our users, we would deeply appreciate it if you could consider optionally filling out a brief registration ["questionnaire"](https://modelbest.feishu.cn/share/base/form/shrcnpV5ZT9EJ6xYjh3Kx0J6v8g).
1478
+
1479
+
1480
+ #### Statement
1481
+ * As an LMM, MiniCPM-o 2.6 generates contents by learning a large mount of multimodal corpora, but it cannot comprehend, express personal opinions or make value judgement. Anything generated by MiniCPM-o 2.6 does not represent the views and positions of the model developers
1482
+ * We will not be liable for any problems arising from the use of the MinCPM-V models, including but not limited to data security issues, risk of public opinion, or any risks and problems arising from the misdirection, misuse, dissemination or misuse of the model.
1483
+
1484
+ ## Key Techniques and Other Multimodal Projects
1485
+
1486
+ 👏 Welcome to explore key techniques of MiniCPM-o 2.6 and other multimodal projects of our team:
1487
 
1488
+ [VisCPM](https://github.com/OpenBMB/VisCPM/tree/main) | [RLHF-V](https://github.com/RLHF-V/RLHF-V) | [LLaVA-UHD](https://github.com/thunlp/LLaVA-UHD) | [RLAIF-V](https://github.com/RLHF-V/RLAIF-V)
1489
+
1490
+ ## Citation
1491
+
1492
+ If you find our work helpful, please consider citing our papers 📝 and liking this project ❤️!
1493
+
1494
+ ```bib
1495
+ @article{yao2024minicpm,
1496
+ title={MiniCPM-V: A GPT-4V Level MLLM on Your Phone},
1497
+ author={Yao, Yuan and Yu, Tianyu and Zhang, Ao and Wang, Chongyi and Cui, Junbo and Zhu, Hongji and Cai, Tianchi and Li, Haoyu and Zhao, Weilin and He, Zhihui and others},
1498
+ journal={arXiv preprint arXiv:2408.01800},
1499
+ year={2024}
1500
+ }
1501
+ ```
pytorch_model.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:5af168eabd5d0954ff83073fe3b36fc7fd6a8c9e8b6591132313241d8165d91c
3
- size 69676335
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7955f2e2d47a7db417469976a67281e11a004cc9fc86b21de5ae626a1f2d11a8
3
+ size 34895535