Update readme from github
Browse files
README.md
CHANGED
|
@@ -35,12 +35,19 @@ A Gemini 2.5 Flash Level MLLM for Vision, Speech, and Full-Duplex Mulitmodal Liv
|
|
| 35 |
As a new feature, MiniCPM-o 4.5 can process real-time, continuous video and audio input streams simultaneously while generating concurrent text and speech output streams in an end-to-end fashion, without mutual blocking. This **allows MiniCPM-o 4.5 to see, listen, and speak simultaneously**, creating a fluid, real-time omnimodal conversation experience. Beyond reactive responses, the model can also perform **proactive interaction**, such as initiating reminders or comments based on its continuous understanding of the live scene.
|
| 36 |
|
| 37 |
- 💪 **Strong OCR Capability, Efficiency and Others.**
|
| 38 |
-
Advancing popular visual capabilities from MiniCPM-V series, MiniCPM-o 4.5 can process **high-resolution images** (up to 1.8 million pixels) and **high-FPS videos** (up to 10fps) in any aspect ratio efficiently. It achieves **state-of-the-art
|
| 39 |
|
| 40 |
- 💫 **Easy Usage.**
|
| 41 |
-
MiniCPM-o 4.5 can be easily used in various ways: (1)
|
| 42 |
|
| 43 |
**Model Architecture.**
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 44 |
|
| 45 |
<div align="center">
|
| 46 |
<img src="https://raw.githubusercontent.com/OpenBMB/MiniCPM-o/main/assets/minicpm-o-45-framework.png" width=100%>
|
|
@@ -60,7 +67,10 @@ Advancing popular visual capabilities from MiniCPM-V series, MiniCPM-o 4.5 can p
|
|
| 60 |
<div align="center">
|
| 61 |
<img src="https://raw.githubusercontent.com/openbmb/MiniCPM-o/main/assets/minicpm_o_45_main_exp_table.png", width=90%>
|
| 62 |
</div>
|
| 63 |
-
Note
|
|
|
|
|
|
|
|
|
|
| 64 |
|
| 65 |
<details>
|
| 66 |
<summary>Click to view visual understanding results.</summary>
|
|
@@ -653,9 +663,9 @@ Note: Scores marked with ∗ are from our evaluation; others are cited from refe
|
|
| 653 |
</details>
|
| 654 |
|
| 655 |
<details>
|
| 656 |
-
<summary>Click to view omni
|
| 657 |
|
| 658 |
-
**Omni
|
| 659 |
<div align="center">
|
| 660 |
<table style="margin: 0px auto;">
|
| 661 |
<tr>
|
|
@@ -962,12 +972,17 @@ Note: Scores marked with ∗ are from our evaluation; others are cited from refe
|
|
| 962 |
<a href="https://www.youtube.com/watch?v=6UzC-O1Q-1U"><img src="https://raw.githubusercontent.com/openbmb/MiniCPM-o/main/assets/minicpmo4_5/video_play.png", width=70%></a>
|
| 963 |
</div>
|
| 964 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 965 |
### Examples: 🎙️ Speech Conversation <!-- omit in toc -->
|
| 966 |
|
| 967 |
> [!NOTE]
|
| 968 |
> For detailed speech conversation examples, refer to [Audio Demo Page](https://openbmb.github.io/minicpm-o-4_5/)
|
| 969 |
|
| 970 |
-
|
| 971 |
|
| 972 |
<details open>
|
| 973 |
<summary>🚀 <b>Elon Musk</b> - Voice Roleplay (EN)</summary>
|
|
@@ -997,7 +1012,7 @@ Simplex speech conversation with custom reference audio and character prompts.
|
|
| 997 |
</div>
|
| 998 |
|
| 999 |
|
| 1000 |
-
##
|
| 1001 |
|
| 1002 |
Inference using Hugging Face Transformers on NVIDIA GPUs. Please ensure `transformers==4.51.0` is installed, as other versions may have compatibility issues (under investigation). Requirements tested on Python 3.10:
|
| 1003 |
|
|
@@ -1057,11 +1072,11 @@ model.eval().cuda()
|
|
| 1057 |
# Initialize TTS for audio output
|
| 1058 |
model.init_tts()
|
| 1059 |
|
| 1060 |
-
# Convert
|
| 1061 |
duplex_model = model.as_duplex()
|
| 1062 |
|
| 1063 |
-
# Convert duplex model back to
|
| 1064 |
-
|
| 1065 |
```
|
| 1066 |
|
| 1067 |
|
|
@@ -1158,7 +1173,7 @@ generate_duplex_video(
|
|
| 1158 |
```
|
| 1159 |
|
| 1160 |
|
| 1161 |
-
###
|
| 1162 |
We provide two inference modes: chat and streaming.
|
| 1163 |
|
| 1164 |
#### Chat Inference <!-- omit in toc -->
|
|
@@ -1299,10 +1314,10 @@ else:
|
|
| 1299 |
</details>
|
| 1300 |
|
| 1301 |
|
| 1302 |
-
###
|
| 1303 |
|
| 1304 |
<details>
|
| 1305 |
-
<summary>Click to show
|
| 1306 |
|
| 1307 |
First, make sure you have all dependencies, especially `"minicpmo-utils[all]>=1.0.5"`:
|
| 1308 |
```bash
|
|
@@ -1427,11 +1442,12 @@ else:
|
|
| 1427 |
|
| 1428 |
#### Speech Conversation as a Versatile and Vibe AI Assistant <!-- omit in toc -->
|
| 1429 |
|
| 1430 |
-
Built on carefully designed post-training data and professional voice-actor recordings, `MiniCPM-o-4.5` can also function as an AI voice assistant. It delivers high-quality spoken interaction out of the box. It produces a sweet and expressive voice with natural prosody, including appropriate rhythm, stress, and pauses, giving a strong sense of liveliness in casual conversation. It also supports storytelling and narrative speech with coherent and engaging delivery. Moreover, it enables advanced voice instruction control. like emotional tone, word-level emphasis.
|
| 1431 |
|
| 1432 |
<details>
|
| 1433 |
<summary>Click to show AI assistant conversation code.</summary>
|
| 1434 |
|
|
|
|
|
|
|
| 1435 |
```python
|
| 1436 |
import librosa
|
| 1437 |
|
|
@@ -1465,11 +1481,11 @@ sys_msg = {
|
|
| 1465 |
|
| 1466 |
#### General Speech Conversation with Custom Voice and Custom System Profile <!-- omit in toc -->
|
| 1467 |
|
| 1468 |
-
MiniCPM-o-4.5 can role-play as a specific character based on an audio prompt and text profile prompt. It mimics the character's voice and adopts their language style in text responses. It also follows profile defined in text profile. In this mode, MiniCPM-o-4.5 sounds **more natural and human-like**.
|
| 1469 |
-
|
| 1470 |
<details>
|
| 1471 |
<summary>Click to show custom voice conversation code.</summary>
|
| 1472 |
|
|
|
|
|
|
|
| 1473 |
```python
|
| 1474 |
import librosa
|
| 1475 |
|
|
@@ -1527,11 +1543,12 @@ sys_msg = {
|
|
| 1527 |
|
| 1528 |
#### Zero-shot Text-to-speech (TTS) <!-- omit in toc -->
|
| 1529 |
|
| 1530 |
-
`MiniCPM-o-4.5` supports zero-shot text-to-speech (TTS). In this mode, the model functions as a highly-natural TTS system that can replicate a reference voice.
|
| 1531 |
|
| 1532 |
<details>
|
| 1533 |
<summary>Click to show TTS code.</summary>
|
| 1534 |
|
|
|
|
|
|
|
| 1535 |
```python
|
| 1536 |
import librosa
|
| 1537 |
|
|
@@ -1580,11 +1597,11 @@ res = model.chat(
|
|
| 1580 |
|
| 1581 |
#### Mimick <!-- omit in toc -->
|
| 1582 |
|
| 1583 |
-
The `Mimick` task evaluates a model's end-to-end speech modeling capability. The model takes audio input, transcribes it, and reconstructs the original audio with high fidelity, preserving detailed acoustic, paralinguistic, and semantic information. Higher similarity between the reconstructed and original audio indicates stronger end-to-end speech modeling capability.
|
| 1584 |
-
|
| 1585 |
<details>
|
| 1586 |
<summary>Click to show mimick code.</summary>
|
| 1587 |
|
|
|
|
|
|
|
| 1588 |
```python
|
| 1589 |
import librosa
|
| 1590 |
|
|
@@ -1618,6 +1635,10 @@ res = model.chat(
|
|
| 1618 |
|
| 1619 |
#### Addressing Various Audio Understanding Tasks <!-- omit in toc -->
|
| 1620 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1621 |
`MiniCPM-o-4.5` can also handle various audio understanding tasks, such as ASR, speaker analysis, general audio captioning, and sound scene tagging.
|
| 1622 |
|
| 1623 |
For audio-to-text tasks, you can use the following prompts:
|
|
@@ -1628,9 +1649,6 @@ For audio-to-text tasks, you can use the following prompts:
|
|
| 1628 |
- General Audio Caption: `Summarize the main content of the audio.`
|
| 1629 |
- Sound Scene Tagging: `Utilize one keyword to convey the audio's content or the associated scene.`
|
| 1630 |
|
| 1631 |
-
<details>
|
| 1632 |
-
<summary>Click to show audio understanding code.</summary>
|
| 1633 |
-
|
| 1634 |
```python
|
| 1635 |
import librosa
|
| 1636 |
|
|
@@ -1688,11 +1706,7 @@ image = Image.open("assets/fossil.png").convert("RGB")
|
|
| 1688 |
question = "What is in the image?"
|
| 1689 |
msgs = [{"role": "user", "content": [image, question]}]
|
| 1690 |
|
| 1691 |
-
|
| 1692 |
-
stream=False # If `stream=True`, return string generator
|
| 1693 |
-
|
| 1694 |
-
## default max_slice_nums=9, set max_slice_nums=25 for pdf parse task
|
| 1695 |
-
res = model.chat(msgs=msgs, use_tts_template=False, enable_thinking=enable_thinking, stream=stream)
|
| 1696 |
print(res)
|
| 1697 |
```
|
| 1698 |
|
|
@@ -1827,6 +1841,36 @@ msgs = [
|
|
| 1827 |
</details>
|
| 1828 |
|
| 1829 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1830 |
## FlagOS
|
| 1831 |
<details>
|
| 1832 |
<summary>Click to show FlagOS Usage details.</summary>
|
|
@@ -1916,10 +1960,17 @@ FlagRelease is a platform developed by the FlagOS team for automatic migration,
|
|
| 1916 |
|
| 1917 |
</details>
|
| 1918 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1919 |
|
| 1920 |
## MiniCPM-V & o Cookbook
|
| 1921 |
|
| 1922 |
-
Discover comprehensive, ready-to-deploy solutions for the MiniCPM-V and MiniCPM-o model series in our structured [
|
| 1923 |
|
| 1924 |
**Easy Usage Documentation**
|
| 1925 |
|
|
@@ -1930,8 +1981,8 @@ All features are displayed at a glance, making it easy for you to quickly find e
|
|
| 1930 |
|
| 1931 |
We support a wide range of users, from individuals to enterprises and researchers.
|
| 1932 |
|
| 1933 |
-
* **Individuals**: Enjoy effortless inference using [
|
| 1934 |
-
* **Enterprises**: Achieve high-throughput, scalable performance with [
|
| 1935 |
* **Researchers**: Leverage advanced frameworks including [Transformers](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/finetune/finetune_full.md), [LLaMA-Factory](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/finetune/finetune_llamafactory.md), [SWIFT](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/finetune/swift.md), and [Align-anything](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/finetune/align_anything.md) to enable flexible model development and cutting-edge experimentation.
|
| 1936 |
|
| 1937 |
**Versatile Deployment Scenarios**
|
|
@@ -1947,8 +1998,8 @@ Our ecosystem delivers optimal solution for a variety of hardware environments a
|
|
| 1947 |
* The MiniCPM-o/V model weights and code are open-sourced under the [Apache-2.0](https://github.com/OpenBMB/MiniCPM-V/blob/main/LICENSE) license.
|
| 1948 |
|
| 1949 |
#### Statement
|
| 1950 |
-
* As
|
| 1951 |
-
* We will not be liable for any problems arising from the use of
|
| 1952 |
|
| 1953 |
|
| 1954 |
## Key Techniques and Other Multimodal Projects <!-- omit in toc -->
|
|
|
|
| 35 |
As a new feature, MiniCPM-o 4.5 can process real-time, continuous video and audio input streams simultaneously while generating concurrent text and speech output streams in an end-to-end fashion, without mutual blocking. This **allows MiniCPM-o 4.5 to see, listen, and speak simultaneously**, creating a fluid, real-time omnimodal conversation experience. Beyond reactive responses, the model can also perform **proactive interaction**, such as initiating reminders or comments based on its continuous understanding of the live scene.
|
| 36 |
|
| 37 |
- 💪 **Strong OCR Capability, Efficiency and Others.**
|
| 38 |
+
Advancing popular visual capabilities from MiniCPM-V series, MiniCPM-o 4.5 can process **high-resolution images** (up to 1.8 million pixels) and **high-FPS videos** (up to 10fps) in any aspect ratio efficiently. It achieves **state-of-the-art performance for end-to-end English document parsing** on OmniDocBench, outperforming proprietary models such as Gemini-3 Flash and GPT-5, and specialized tools such as DeepSeek-OCR 2. It also features **trustworthy behaviors**, matching Gemini 2.5 Flash on MMHal-Bench, and supports **multilingual capabilities** on more than 30 languages.
|
| 39 |
|
| 40 |
- 💫 **Easy Usage.**
|
| 41 |
+
MiniCPM-o 4.5 can be easily used in various ways: **Basic usage, recommended for 100% precision:** PyTorch inference with Nvidia GPU. **Other end-side adaptation** includes (1) llama.cpp and Ollama support for efficient CPU inference on local devices, (2) int4 and GGUF format quantized models in 16 sizes, (3) vLLM and SGLang support for high-throughput and memory-efficient inference, (4) FlagOS support for the unified multi-chip backend plugin. **We also open-sourced web demos** on which **enables the full-duplex multimodal live streaming experience on local devices** such as GPUs, PCs (e.g., on a MacBook).
|
| 42 |
|
| 43 |
**Model Architecture.**
|
| 44 |
+
- **End-to-end Omni-modal Architecture.** The modality encoders/decoders and LLM are densely connected via hidden states in an end-to-end fashion. This enables better information flow and control, and also facilitates full exploitation of rich multimodal knowledge during training.
|
| 45 |
+
- **Full-Duplex Omni-modal Live Streaming Mechanism.** (1) We turn the offline modality encoder/decoders into online and full-duplex ones for streaming inputs/outputs. The speech token decoder models text and speech tokens in an interleaved fashion to support full-duplex speech generation (i.e., sync timely with new input). This also facilitates more stable long speech generation (e.g., > 1min).
|
| 46 |
+
(2) **We sync all the input and output streams on timeline in milliseconds**, which are jointly modeled by a time-division multiplexing (TDM) mechanism for omni-modality streaming processing in the LLM backbone. It divides parallel omni-modality streams into sequential info groups within small periodic time slices.
|
| 47 |
+
- **Proactive Interaction Mechanism.** The LLM continuously monitors the input video and audio streams, and decides at a frequency of 1Hz to speak or not. This high decision-making frequency together with full-duplex nature are curcial to enable the proactive interaction capability.
|
| 48 |
+
- **Configurable Speech Modeling Design.** We inherent the multimodal system prompt design of MiniCPM-o 2.6, which includes a traditional text system prompt, and a new audio system prompt to determine the assistant voice. This enables cloning new voices and role play in inference time for speech conversation.
|
| 49 |
+
|
| 50 |
+
|
| 51 |
|
| 52 |
<div align="center">
|
| 53 |
<img src="https://raw.githubusercontent.com/OpenBMB/MiniCPM-o/main/assets/minicpm-o-45-framework.png" width=100%>
|
|
|
|
| 67 |
<div align="center">
|
| 68 |
<img src="https://raw.githubusercontent.com/openbmb/MiniCPM-o/main/assets/minicpm_o_45_main_exp_table.png", width=90%>
|
| 69 |
</div>
|
| 70 |
+
<strong>Note</strong>: Scores marked with ∗ are from our evaluation; others are cited from referenced reports. n/a indicates that the model does not support the corresponding modality. All results are reported in instruct mode/variant.
|
| 71 |
+
|
| 72 |
+
 
|
| 73 |
+
<br>
|
| 74 |
|
| 75 |
<details>
|
| 76 |
<summary>Click to view visual understanding results.</summary>
|
|
|
|
| 663 |
</details>
|
| 664 |
|
| 665 |
<details>
|
| 666 |
+
<summary>Click to view omni half-duplex results.</summary>
|
| 667 |
|
| 668 |
+
**Omni Half-Duplex**
|
| 669 |
<div align="center">
|
| 670 |
<table style="margin: 0px auto;">
|
| 671 |
<tr>
|
|
|
|
| 972 |
<a href="https://www.youtube.com/watch?v=6UzC-O1Q-1U"><img src="https://raw.githubusercontent.com/openbmb/MiniCPM-o/main/assets/minicpmo4_5/video_play.png", width=70%></a>
|
| 973 |
</div>
|
| 974 |
|
| 975 |
+
### Examples: Omnimodal Full-Duplex Conversation <!-- omit in toc -->
|
| 976 |
+
|
| 977 |
+
> [!NOTE]
|
| 978 |
+
> For detailed speech conversation examples, refer to [Omni Full-Duplex Casebook](https://openbmb.github.io/minicpm-o-4_5-omni/)
|
| 979 |
+
|
| 980 |
### Examples: 🎙️ Speech Conversation <!-- omit in toc -->
|
| 981 |
|
| 982 |
> [!NOTE]
|
| 983 |
> For detailed speech conversation examples, refer to [Audio Demo Page](https://openbmb.github.io/minicpm-o-4_5/)
|
| 984 |
|
| 985 |
+
Half-duplex speech conversation with custom reference audio and character prompts.
|
| 986 |
|
| 987 |
<details open>
|
| 988 |
<summary>🚀 <b>Elon Musk</b> - Voice Roleplay (EN)</summary>
|
|
|
|
| 1012 |
</div>
|
| 1013 |
|
| 1014 |
|
| 1015 |
+
## Offline Inference Examples with Transformers
|
| 1016 |
|
| 1017 |
Inference using Hugging Face Transformers on NVIDIA GPUs. Please ensure `transformers==4.51.0` is installed, as other versions may have compatibility issues (under investigation). Requirements tested on Python 3.10:
|
| 1018 |
|
|
|
|
| 1072 |
# Initialize TTS for audio output
|
| 1073 |
model.init_tts()
|
| 1074 |
|
| 1075 |
+
# Convert half-duplex model to duplex mode
|
| 1076 |
duplex_model = model.as_duplex()
|
| 1077 |
|
| 1078 |
+
# Convert duplex model back to half-duplex mode
|
| 1079 |
+
model = duplex_model.as_simplex(reset_session=True)
|
| 1080 |
```
|
| 1081 |
|
| 1082 |
|
|
|
|
| 1173 |
```
|
| 1174 |
|
| 1175 |
|
| 1176 |
+
### Half-Duplex Omni Mode <!-- omit in toc -->
|
| 1177 |
We provide two inference modes: chat and streaming.
|
| 1178 |
|
| 1179 |
#### Chat Inference <!-- omit in toc -->
|
|
|
|
| 1314 |
</details>
|
| 1315 |
|
| 1316 |
|
| 1317 |
+
### Half-Duplex Realtime Speech Conversation Mode <!-- omit in toc -->
|
| 1318 |
|
| 1319 |
<details>
|
| 1320 |
+
<summary>Click to show half-duplex mode realtime speech conversation API usage.</summary>
|
| 1321 |
|
| 1322 |
First, make sure you have all dependencies, especially `"minicpmo-utils[all]>=1.0.5"`:
|
| 1323 |
```bash
|
|
|
|
| 1442 |
|
| 1443 |
#### Speech Conversation as a Versatile and Vibe AI Assistant <!-- omit in toc -->
|
| 1444 |
|
|
|
|
| 1445 |
|
| 1446 |
<details>
|
| 1447 |
<summary>Click to show AI assistant conversation code.</summary>
|
| 1448 |
|
| 1449 |
+
Built on carefully designed post-training data and professional voice-actor recordings, `MiniCPM-o-4.5` can also function as an AI voice assistant. It delivers high-quality spoken interaction out of the box. It produces a sweet and expressive voice with natural prosody, including appropriate rhythm, stress, and pauses, giving a strong sense of liveliness in casual conversation. It also supports storytelling and narrative speech with coherent and engaging delivery. Moreover, it enables advanced voice instruction control. like emotional tone, word-level emphasis.
|
| 1450 |
+
|
| 1451 |
```python
|
| 1452 |
import librosa
|
| 1453 |
|
|
|
|
| 1481 |
|
| 1482 |
#### General Speech Conversation with Custom Voice and Custom System Profile <!-- omit in toc -->
|
| 1483 |
|
|
|
|
|
|
|
| 1484 |
<details>
|
| 1485 |
<summary>Click to show custom voice conversation code.</summary>
|
| 1486 |
|
| 1487 |
+
MiniCPM-o-4.5 can role-play as a specific character based on an audio prompt and text profile prompt. It mimics the character's voice and adopts their language style in text responses. It also follows profile defined in text profile. In this mode, MiniCPM-o-4.5 sounds **more natural and human-like**.
|
| 1488 |
+
|
| 1489 |
```python
|
| 1490 |
import librosa
|
| 1491 |
|
|
|
|
| 1543 |
|
| 1544 |
#### Zero-shot Text-to-speech (TTS) <!-- omit in toc -->
|
| 1545 |
|
|
|
|
| 1546 |
|
| 1547 |
<details>
|
| 1548 |
<summary>Click to show TTS code.</summary>
|
| 1549 |
|
| 1550 |
+
`MiniCPM-o-4.5` supports zero-shot text-to-speech (TTS). In this mode, the model functions as a highly-natural TTS system that can replicate a reference voice.
|
| 1551 |
+
|
| 1552 |
```python
|
| 1553 |
import librosa
|
| 1554 |
|
|
|
|
| 1597 |
|
| 1598 |
#### Mimick <!-- omit in toc -->
|
| 1599 |
|
|
|
|
|
|
|
| 1600 |
<details>
|
| 1601 |
<summary>Click to show mimick code.</summary>
|
| 1602 |
|
| 1603 |
+
The `Mimick` task evaluates a model's end-to-end speech modeling capability. The model takes audio input, transcribes it, and reconstructs the original audio with high fidelity, preserving detailed acoustic, paralinguistic, and semantic information. Higher similarity between the reconstructed and original audio indicates stronger end-to-end speech modeling capability.
|
| 1604 |
+
|
| 1605 |
```python
|
| 1606 |
import librosa
|
| 1607 |
|
|
|
|
| 1635 |
|
| 1636 |
#### Addressing Various Audio Understanding Tasks <!-- omit in toc -->
|
| 1637 |
|
| 1638 |
+
|
| 1639 |
+
<details>
|
| 1640 |
+
<summary>Click to show audio understanding code.</summary>
|
| 1641 |
+
|
| 1642 |
`MiniCPM-o-4.5` can also handle various audio understanding tasks, such as ASR, speaker analysis, general audio captioning, and sound scene tagging.
|
| 1643 |
|
| 1644 |
For audio-to-text tasks, you can use the following prompts:
|
|
|
|
| 1649 |
- General Audio Caption: `Summarize the main content of the audio.`
|
| 1650 |
- Sound Scene Tagging: `Utilize one keyword to convey the audio's content or the associated scene.`
|
| 1651 |
|
|
|
|
|
|
|
|
|
|
| 1652 |
```python
|
| 1653 |
import librosa
|
| 1654 |
|
|
|
|
| 1706 |
question = "What is in the image?"
|
| 1707 |
msgs = [{"role": "user", "content": [image, question]}]
|
| 1708 |
|
| 1709 |
+
res = model.chat(msgs=msgs, use_tts_template=False)
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1710 |
print(res)
|
| 1711 |
```
|
| 1712 |
|
|
|
|
| 1841 |
</details>
|
| 1842 |
|
| 1843 |
|
| 1844 |
+
## Deploy a Realtime Web Demo on Your Own Device
|
| 1845 |
+
|
| 1846 |
+
### Option A (Recommended): **PyTorch Inference with Nvidia GPU** for 100% model precision with no deductions in performance.
|
| 1847 |
+
|
| 1848 |
+
We provide a PyTorch-based [simplified yet full-functional web demo](https://github.com/OpenBMB/minicpm-o-4_5-pytorch-simple-demo) which could boost the model inference performance, supports:
|
| 1849 |
+
|
| 1850 |
+
- full-duplex omnimodal live streaming
|
| 1851 |
+
- full-duplex speech live streaming
|
| 1852 |
+
- half-duplex speech live streaming (under development)
|
| 1853 |
+
- turn-based chat conversation
|
| 1854 |
+
- customizable system prompts
|
| 1855 |
+
- customizable reference audio
|
| 1856 |
+
- simple and readable codebase for continual development
|
| 1857 |
+
- serve as API backend for third-party applications
|
| 1858 |
+
|
| 1859 |
+
Requirements:
|
| 1860 |
+
- Nvidia GPU with at least 28GB GPU memory. *We are working on optimizing the model for lower GPU memory usage.*
|
| 1861 |
+
|
| 1862 |
+
### Option B: **llama.cpp-omni** for end-side inference with PCs like Mac and low-resource devices.
|
| 1863 |
+
|
| 1864 |
+
With a fully C++ implementation of `MiniCPM-o 4.5` and quantized weights, `llama.cpp-omni` supports:
|
| 1865 |
+
- half-duplex speech realtime conversation
|
| 1866 |
+
- full-duplex omnimodal live streaming
|
| 1867 |
+
|
| 1868 |
+
We provide [ready-to-run guidance](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/demo/web_demo/WebRTC_Demo/README.md) to access the low-latency full-duplex communication directly on your own Mac using our new official Docker image.
|
| 1869 |
+
|
| 1870 |
+
Requirements:
|
| 1871 |
+
- For half-duplex speech realtime conversation: Apple M3/M4/M5 chip with at least 16GB RAM or low-resource Nvidia GPU with at least 12GB GPU memory
|
| 1872 |
+
- For full-duplex omnimodal live streaming: Apple M4 Max chip with at least 24GB RAM or low-resource Nvidia GPU with at least 12GB GPU memory
|
| 1873 |
+
|
| 1874 |
## FlagOS
|
| 1875 |
<details>
|
| 1876 |
<summary>Click to show FlagOS Usage details.</summary>
|
|
|
|
| 1960 |
|
| 1961 |
</details>
|
| 1962 |
|
| 1963 |
+
### vLLM, SGLang, llama.cpp, Ollama
|
| 1964 |
+
|
| 1965 |
+
We support inference with vLLM, SGLang, llama.cpp and Ollama. Refer to our [Cookbook](https://github.com/OpenSQZ/MiniCPM-V-Cookbook) for more details.
|
| 1966 |
+
|
| 1967 |
+
### LLaMA-Factory, SWIFT
|
| 1968 |
+
|
| 1969 |
+
We support fine-tuning with LLaMA-Factory, SWIFT. Refer to our [Cookbook](https://github.com/OpenSQZ/MiniCPM-V-Cookbook) for more details.
|
| 1970 |
|
| 1971 |
## MiniCPM-V & o Cookbook
|
| 1972 |
|
| 1973 |
+
Discover comprehensive, ready-to-deploy solutions for the MiniCPM-V and MiniCPM-o model series in our structured [Cookbook](https://github.com/OpenSQZ/MiniCPM-V-CookBook), which empowers developers to rapidly implement multimodal AI applications with integrated vision, speech, and live-streaming capabilities. Key features include:
|
| 1974 |
|
| 1975 |
**Easy Usage Documentation**
|
| 1976 |
|
|
|
|
| 1981 |
|
| 1982 |
We support a wide range of users, from individuals to enterprises and researchers.
|
| 1983 |
|
| 1984 |
+
* **Individuals**: Enjoy effortless inference using Ollama ([V4](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/deployment/ollama/minicpm-v4_ollama.md), [o4.5](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/deployment/ollama/minicpm-o4_5_ollama.md)) and Llama.cpp ([V4](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/deployment/llama.cpp/minicpm-v4_llamacpp.md), [o4.5](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/deployment/llama.cpp/minicpm-o4_5_llamacpp.md)) with minimal setup.
|
| 1985 |
+
* **Enterprises**: Achieve high-throughput, scalable performance with vLLM ([V4](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/deployment/vllm/minicpm-v4_vllm.md), [o4.5](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/deployment/vllm/minicpm-o4_5_vllm.md)) and SGLang ([V4](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/deployment/sglang/MiniCPM-v4_sglang.md), [o4.5](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/deployment/sglang/MiniCPM-o4_5_sglang.md)).
|
| 1986 |
* **Researchers**: Leverage advanced frameworks including [Transformers](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/finetune/finetune_full.md), [LLaMA-Factory](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/finetune/finetune_llamafactory.md), [SWIFT](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/finetune/swift.md), and [Align-anything](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/finetune/align_anything.md) to enable flexible model development and cutting-edge experimentation.
|
| 1987 |
|
| 1988 |
**Versatile Deployment Scenarios**
|
|
|
|
| 1998 |
* The MiniCPM-o/V model weights and code are open-sourced under the [Apache-2.0](https://github.com/OpenBMB/MiniCPM-V/blob/main/LICENSE) license.
|
| 1999 |
|
| 2000 |
#### Statement
|
| 2001 |
+
* As MLLMs, MiniCPM-o/V models generate content by learning a large number of multimodal corpora, but they cannot comprehend, express personal opinions, or make value judgements. Anything generated by MiniCPM-o/V models does not represent the views and positions of the model developers
|
| 2002 |
+
* We will not be liable for any problems arising from the use of MiniCPM-o/V models, including but not limited to data security issues, risk of public opinion, or any risks and problems arising from the misdirection, misuse, dissemination, or misuse of the model.
|
| 2003 |
|
| 2004 |
|
| 2005 |
## Key Techniques and Other Multimodal Projects <!-- omit in toc -->
|