bokesyo commited on
Commit
ab2d5c7
·
1 Parent(s): ed428df

Update readme from github

Browse files
Files changed (1) hide show
  1. README.md +83 -32
README.md CHANGED
@@ -35,12 +35,19 @@ A Gemini 2.5 Flash Level MLLM for Vision, Speech, and Full-Duplex Mulitmodal Liv
35
  As a new feature, MiniCPM-o 4.5 can process real-time, continuous video and audio input streams simultaneously while generating concurrent text and speech output streams in an end-to-end fashion, without mutual blocking. This **allows MiniCPM-o 4.5 to see, listen, and speak simultaneously**, creating a fluid, real-time omnimodal conversation experience. Beyond reactive responses, the model can also perform **proactive interaction**, such as initiating reminders or comments based on its continuous understanding of the live scene.
36
 
37
  - 💪 **Strong OCR Capability, Efficiency and Others.**
38
- Advancing popular visual capabilities from MiniCPM-V series, MiniCPM-o 4.5 can process **high-resolution images** (up to 1.8 million pixels) and **high-FPS videos** (up to 10fps) in any aspect ratio efficiently. It achieves **state-of-the-art peformance for end-to-end English document parsing** on OmniDocBench, outperforming proprietary models such as Gemini-3 Flash and GPT-5, and specialized tools such as DeepSeek-OCR 2. It also features **trustworthy behaviors**, matching Gemini 2.5 Flash on MMHal-Bench, and supports **multilingual capabilities** on more than 30 languages.
39
 
40
  - 💫 **Easy Usage.**
41
- MiniCPM-o 4.5 can be easily used in various ways: (1) [llama.cpp](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/deployment/llama.cpp/minicpm-o4_5_llamacpp.md) and [Ollama](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/deployment/ollama/minicpm-o4_5_ollama.md) support for efficient CPU inference on local devices, (2) [int4](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/quantization/awq/minicpm-o4_5_awq_quantize.md) and [GGUF](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/quantization/gguf/minicpm-o4_5_gguf_quantize.md) format quantized models in 16 sizes, (3) [vLLM](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/deployment/vllm/minicpm-o4_5_vllm.md) and [SGLang](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/deployment/sglang/MiniCPM-o4_5_sglang.md) support for high-throughput and memory-efficient inference, (4) [FlagOS](#flagos) support for the unified multi-chip backend plugin, (5) fine-tuning on new domains and tasks with [LLaMA-Factory](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/finetune/llama-factory/finetune_llamafactory.md), and (6) online web demo on [server](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/demo/web_demo/gradio/README_o45.md). We also rollout a high-performing [llama.cpp-omni](https://github.com/tc-mb/llama.cpp-omni) inference framework together with a [WebRTC Demo](https://minicpm-omni.openbmb.cn/), which **enables the full-duplex multimodal live streaming experience on local devices** such as [PCs](https://github.com/tc-mb/llama.cpp-omni/blob/master/README.md) (e.g., on a MacBook).
42
 
43
  **Model Architecture.**
 
 
 
 
 
 
 
44
 
45
  <div align="center">
46
  <img src="https://raw.githubusercontent.com/OpenBMB/MiniCPM-o/main/assets/minicpm-o-45-framework.png" width=100%>
@@ -60,7 +67,10 @@ Advancing popular visual capabilities from MiniCPM-V series, MiniCPM-o 4.5 can p
60
  <div align="center">
61
  <img src="https://raw.githubusercontent.com/openbmb/MiniCPM-o/main/assets/minicpm_o_45_main_exp_table.png", width=90%>
62
  </div>
63
- Note: Scores marked with ∗ are from our evaluation; others are cited from referenced reports. n/a indicates that the model does not support the corresponding modality. All results are reported in instruct mode/variant.
 
 
 
64
 
65
  <details>
66
  <summary>Click to view visual understanding results.</summary>
@@ -653,9 +663,9 @@ Note: Scores marked with ∗ are from our evaluation; others are cited from refe
653
  </details>
654
 
655
  <details>
656
- <summary>Click to view omni simplex results.</summary>
657
 
658
- **Omni Simplex**
659
  <div align="center">
660
  <table style="margin: 0px auto;">
661
  <tr>
@@ -962,12 +972,17 @@ Note: Scores marked with ∗ are from our evaluation; others are cited from refe
962
  <a href="https://www.youtube.com/watch?v=6UzC-O1Q-1U"><img src="https://raw.githubusercontent.com/openbmb/MiniCPM-o/main/assets/minicpmo4_5/video_play.png", width=70%></a>
963
  </div>
964
 
 
 
 
 
 
965
  ### Examples: 🎙️ Speech Conversation <!-- omit in toc -->
966
 
967
  > [!NOTE]
968
  > For detailed speech conversation examples, refer to [Audio Demo Page](https://openbmb.github.io/minicpm-o-4_5/)
969
 
970
- Simplex speech conversation with custom reference audio and character prompts.
971
 
972
  <details open>
973
  <summary>🚀 <b>Elon Musk</b> - Voice Roleplay (EN)</summary>
@@ -997,7 +1012,7 @@ Simplex speech conversation with custom reference audio and character prompts.
997
  </div>
998
 
999
 
1000
- ## Usage
1001
 
1002
  Inference using Hugging Face Transformers on NVIDIA GPUs. Please ensure `transformers==4.51.0` is installed, as other versions may have compatibility issues (under investigation). Requirements tested on Python 3.10:
1003
 
@@ -1057,11 +1072,11 @@ model.eval().cuda()
1057
  # Initialize TTS for audio output
1058
  model.init_tts()
1059
 
1060
- # Convert simplex model to duplex mode
1061
  duplex_model = model.as_duplex()
1062
 
1063
- # Convert duplex model back to simplex mode
1064
- simplex_model = duplex_model.as_simplex(reset_session=True)
1065
  ```
1066
 
1067
 
@@ -1158,7 +1173,7 @@ generate_duplex_video(
1158
  ```
1159
 
1160
 
1161
- ### Simplex Omni Mode <!-- omit in toc -->
1162
  We provide two inference modes: chat and streaming.
1163
 
1164
  #### Chat Inference <!-- omit in toc -->
@@ -1299,10 +1314,10 @@ else:
1299
  </details>
1300
 
1301
 
1302
- ### Simplex Realtime Speech Conversation Mode <!-- omit in toc -->
1303
 
1304
  <details>
1305
- <summary>Click to show simplex mode realtime speech conversation API usage.</summary>
1306
 
1307
  First, make sure you have all dependencies, especially `"minicpmo-utils[all]>=1.0.5"`:
1308
  ```bash
@@ -1427,11 +1442,12 @@ else:
1427
 
1428
  #### Speech Conversation as a Versatile and Vibe AI Assistant <!-- omit in toc -->
1429
 
1430
- Built on carefully designed post-training data and professional voice-actor recordings, `MiniCPM-o-4.5` can also function as an AI voice assistant. It delivers high-quality spoken interaction out of the box. It produces a sweet and expressive voice with natural prosody, including appropriate rhythm, stress, and pauses, giving a strong sense of liveliness in casual conversation. It also supports storytelling and narrative speech with coherent and engaging delivery. Moreover, it enables advanced voice instruction control. like emotional tone, word-level emphasis.
1431
 
1432
  <details>
1433
  <summary>Click to show AI assistant conversation code.</summary>
1434
 
 
 
1435
  ```python
1436
  import librosa
1437
 
@@ -1465,11 +1481,11 @@ sys_msg = {
1465
 
1466
  #### General Speech Conversation with Custom Voice and Custom System Profile <!-- omit in toc -->
1467
 
1468
- MiniCPM-o-4.5 can role-play as a specific character based on an audio prompt and text profile prompt. It mimics the character's voice and adopts their language style in text responses. It also follows profile defined in text profile. In this mode, MiniCPM-o-4.5 sounds **more natural and human-like**.
1469
-
1470
  <details>
1471
  <summary>Click to show custom voice conversation code.</summary>
1472
 
 
 
1473
  ```python
1474
  import librosa
1475
 
@@ -1527,11 +1543,12 @@ sys_msg = {
1527
 
1528
  #### Zero-shot Text-to-speech (TTS) <!-- omit in toc -->
1529
 
1530
- `MiniCPM-o-4.5` supports zero-shot text-to-speech (TTS). In this mode, the model functions as a highly-natural TTS system that can replicate a reference voice.
1531
 
1532
  <details>
1533
  <summary>Click to show TTS code.</summary>
1534
 
 
 
1535
  ```python
1536
  import librosa
1537
 
@@ -1580,11 +1597,11 @@ res = model.chat(
1580
 
1581
  #### Mimick <!-- omit in toc -->
1582
 
1583
- The `Mimick` task evaluates a model's end-to-end speech modeling capability. The model takes audio input, transcribes it, and reconstructs the original audio with high fidelity, preserving detailed acoustic, paralinguistic, and semantic information. Higher similarity between the reconstructed and original audio indicates stronger end-to-end speech modeling capability.
1584
-
1585
  <details>
1586
  <summary>Click to show mimick code.</summary>
1587
 
 
 
1588
  ```python
1589
  import librosa
1590
 
@@ -1618,6 +1635,10 @@ res = model.chat(
1618
 
1619
  #### Addressing Various Audio Understanding Tasks <!-- omit in toc -->
1620
 
 
 
 
 
1621
  `MiniCPM-o-4.5` can also handle various audio understanding tasks, such as ASR, speaker analysis, general audio captioning, and sound scene tagging.
1622
 
1623
  For audio-to-text tasks, you can use the following prompts:
@@ -1628,9 +1649,6 @@ For audio-to-text tasks, you can use the following prompts:
1628
  - General Audio Caption: `Summarize the main content of the audio.`
1629
  - Sound Scene Tagging: `Utilize one keyword to convey the audio's content or the associated scene.`
1630
 
1631
- <details>
1632
- <summary>Click to show audio understanding code.</summary>
1633
-
1634
  ```python
1635
  import librosa
1636
 
@@ -1688,11 +1706,7 @@ image = Image.open("assets/fossil.png").convert("RGB")
1688
  question = "What is in the image?"
1689
  msgs = [{"role": "user", "content": [image, question]}]
1690
 
1691
- enable_thinking=False # If `enable_thinking=True`, the thinking mode is enabled.
1692
- stream=False # If `stream=True`, return string generator
1693
-
1694
- ## default max_slice_nums=9, set max_slice_nums=25 for pdf parse task
1695
- res = model.chat(msgs=msgs, use_tts_template=False, enable_thinking=enable_thinking, stream=stream)
1696
  print(res)
1697
  ```
1698
 
@@ -1827,6 +1841,36 @@ msgs = [
1827
  </details>
1828
 
1829
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1830
  ## FlagOS
1831
  <details>
1832
  <summary>Click to show FlagOS Usage details.</summary>
@@ -1916,10 +1960,17 @@ FlagRelease is a platform developed by the FlagOS team for automatic migration,
1916
 
1917
  </details>
1918
 
 
 
 
 
 
 
 
1919
 
1920
  ## MiniCPM-V & o Cookbook
1921
 
1922
- Discover comprehensive, ready-to-deploy solutions for the MiniCPM-V and MiniCPM-o model series in our structured [cookbook](https://github.com/OpenSQZ/MiniCPM-V-CookBook), which empowers developers to rapidly implement multimodal AI applications with integrated vision, speech, and live-streaming capabilities. Key features include:
1923
 
1924
  **Easy Usage Documentation**
1925
 
@@ -1930,8 +1981,8 @@ All features are displayed at a glance, making it easy for you to quickly find e
1930
 
1931
  We support a wide range of users, from individuals to enterprises and researchers.
1932
 
1933
- * **Individuals**: Enjoy effortless inference using [Ollama](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/deployment/ollama/minicpm-v4_ollama.md) and [Llama.cpp](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/deployment/llama.cpp/minicpm-v4_llamacpp.md) with minimal setup.
1934
- * **Enterprises**: Achieve high-throughput, scalable performance with [vLLM](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/deployment/vllm/minicpm-v4_vllm.md) and [SGLang](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/deployment/sglang/MiniCPM-v4_sglang.md).
1935
  * **Researchers**: Leverage advanced frameworks including [Transformers](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/finetune/finetune_full.md), [LLaMA-Factory](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/finetune/finetune_llamafactory.md), [SWIFT](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/finetune/swift.md), and [Align-anything](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/finetune/align_anything.md) to enable flexible model development and cutting-edge experimentation.
1936
 
1937
  **Versatile Deployment Scenarios**
@@ -1947,8 +1998,8 @@ Our ecosystem delivers optimal solution for a variety of hardware environments a
1947
  * The MiniCPM-o/V model weights and code are open-sourced under the [Apache-2.0](https://github.com/OpenBMB/MiniCPM-V/blob/main/LICENSE) license.
1948
 
1949
  #### Statement
1950
- * As an LMM, MiniCPM-o 4.5 generates contents by learning a large amount of multimodal corpora, but it cannot comprehend, express personal opinions or make value judgement. Anything generated by MiniCPM-o 4.5 does not represent the views and positions of the model developers
1951
- * We will not be liable for any problems arising from the use of the MinCPM-o models, including but not limited to data security issues, risk of public opinion, or any risks and problems arising from the misdirection, misuse, dissemination or misuse of the model.
1952
 
1953
 
1954
  ## Key Techniques and Other Multimodal Projects <!-- omit in toc -->
 
35
  As a new feature, MiniCPM-o 4.5 can process real-time, continuous video and audio input streams simultaneously while generating concurrent text and speech output streams in an end-to-end fashion, without mutual blocking. This **allows MiniCPM-o 4.5 to see, listen, and speak simultaneously**, creating a fluid, real-time omnimodal conversation experience. Beyond reactive responses, the model can also perform **proactive interaction**, such as initiating reminders or comments based on its continuous understanding of the live scene.
36
 
37
  - 💪 **Strong OCR Capability, Efficiency and Others.**
38
+ Advancing popular visual capabilities from MiniCPM-V series, MiniCPM-o 4.5 can process **high-resolution images** (up to 1.8 million pixels) and **high-FPS videos** (up to 10fps) in any aspect ratio efficiently. It achieves **state-of-the-art performance for end-to-end English document parsing** on OmniDocBench, outperforming proprietary models such as Gemini-3 Flash and GPT-5, and specialized tools such as DeepSeek-OCR 2. It also features **trustworthy behaviors**, matching Gemini 2.5 Flash on MMHal-Bench, and supports **multilingual capabilities** on more than 30 languages.
39
 
40
  - 💫 **Easy Usage.**
41
+ MiniCPM-o 4.5 can be easily used in various ways: **Basic usage, recommended for 100% precision:** PyTorch inference with Nvidia GPU. **Other end-side adaptation** includes (1) llama.cpp and Ollama support for efficient CPU inference on local devices, (2) int4 and GGUF format quantized models in 16 sizes, (3) vLLM and SGLang support for high-throughput and memory-efficient inference, (4) FlagOS support for the unified multi-chip backend plugin. **We also open-sourced web demos** on which **enables the full-duplex multimodal live streaming experience on local devices** such as GPUs, PCs (e.g., on a MacBook).
42
 
43
  **Model Architecture.**
44
+ - **End-to-end Omni-modal Architecture.** The modality encoders/decoders and LLM are densely connected via hidden states in an end-to-end fashion. This enables better information flow and control, and also facilitates full exploitation of rich multimodal knowledge during training.
45
+ - **Full-Duplex Omni-modal Live Streaming Mechanism.** (1) We turn the offline modality encoder/decoders into online and full-duplex ones for streaming inputs/outputs. The speech token decoder models text and speech tokens in an interleaved fashion to support full-duplex speech generation (i.e., sync timely with new input). This also facilitates more stable long speech generation (e.g., > 1min).
46
+ (2) **We sync all the input and output streams on timeline in milliseconds**, which are jointly modeled by a time-division multiplexing (TDM) mechanism for omni-modality streaming processing in the LLM backbone. It divides parallel omni-modality streams into sequential info groups within small periodic time slices.
47
+ - **Proactive Interaction Mechanism.** The LLM continuously monitors the input video and audio streams, and decides at a frequency of 1Hz to speak or not. This high decision-making frequency together with full-duplex nature are curcial to enable the proactive interaction capability.
48
+ - **Configurable Speech Modeling Design.** We inherent the multimodal system prompt design of MiniCPM-o 2.6, which includes a traditional text system prompt, and a new audio system prompt to determine the assistant voice. This enables cloning new voices and role play in inference time for speech conversation.
49
+
50
+
51
 
52
  <div align="center">
53
  <img src="https://raw.githubusercontent.com/OpenBMB/MiniCPM-o/main/assets/minicpm-o-45-framework.png" width=100%>
 
67
  <div align="center">
68
  <img src="https://raw.githubusercontent.com/openbmb/MiniCPM-o/main/assets/minicpm_o_45_main_exp_table.png", width=90%>
69
  </div>
70
+ <strong>Note</strong>: Scores marked with ∗ are from our evaluation; others are cited from referenced reports. n/a indicates that the model does not support the corresponding modality. All results are reported in instruct mode/variant.
71
+
72
+ &emsp;
73
+ <br>
74
 
75
  <details>
76
  <summary>Click to view visual understanding results.</summary>
 
663
  </details>
664
 
665
  <details>
666
+ <summary>Click to view omni half-duplex results.</summary>
667
 
668
+ **Omni Half-Duplex**
669
  <div align="center">
670
  <table style="margin: 0px auto;">
671
  <tr>
 
972
  <a href="https://www.youtube.com/watch?v=6UzC-O1Q-1U"><img src="https://raw.githubusercontent.com/openbmb/MiniCPM-o/main/assets/minicpmo4_5/video_play.png", width=70%></a>
973
  </div>
974
 
975
+ ### Examples: Omnimodal Full-Duplex Conversation <!-- omit in toc -->
976
+
977
+ > [!NOTE]
978
+ > For detailed speech conversation examples, refer to [Omni Full-Duplex Casebook](https://openbmb.github.io/minicpm-o-4_5-omni/)
979
+
980
  ### Examples: 🎙️ Speech Conversation <!-- omit in toc -->
981
 
982
  > [!NOTE]
983
  > For detailed speech conversation examples, refer to [Audio Demo Page](https://openbmb.github.io/minicpm-o-4_5/)
984
 
985
+ Half-duplex speech conversation with custom reference audio and character prompts.
986
 
987
  <details open>
988
  <summary>🚀 <b>Elon Musk</b> - Voice Roleplay (EN)</summary>
 
1012
  </div>
1013
 
1014
 
1015
+ ## Offline Inference Examples with Transformers
1016
 
1017
  Inference using Hugging Face Transformers on NVIDIA GPUs. Please ensure `transformers==4.51.0` is installed, as other versions may have compatibility issues (under investigation). Requirements tested on Python 3.10:
1018
 
 
1072
  # Initialize TTS for audio output
1073
  model.init_tts()
1074
 
1075
+ # Convert half-duplex model to duplex mode
1076
  duplex_model = model.as_duplex()
1077
 
1078
+ # Convert duplex model back to half-duplex mode
1079
+ model = duplex_model.as_simplex(reset_session=True)
1080
  ```
1081
 
1082
 
 
1173
  ```
1174
 
1175
 
1176
+ ### Half-Duplex Omni Mode <!-- omit in toc -->
1177
  We provide two inference modes: chat and streaming.
1178
 
1179
  #### Chat Inference <!-- omit in toc -->
 
1314
  </details>
1315
 
1316
 
1317
+ ### Half-Duplex Realtime Speech Conversation Mode <!-- omit in toc -->
1318
 
1319
  <details>
1320
+ <summary>Click to show half-duplex mode realtime speech conversation API usage.</summary>
1321
 
1322
  First, make sure you have all dependencies, especially `"minicpmo-utils[all]>=1.0.5"`:
1323
  ```bash
 
1442
 
1443
  #### Speech Conversation as a Versatile and Vibe AI Assistant <!-- omit in toc -->
1444
 
 
1445
 
1446
  <details>
1447
  <summary>Click to show AI assistant conversation code.</summary>
1448
 
1449
+ Built on carefully designed post-training data and professional voice-actor recordings, `MiniCPM-o-4.5` can also function as an AI voice assistant. It delivers high-quality spoken interaction out of the box. It produces a sweet and expressive voice with natural prosody, including appropriate rhythm, stress, and pauses, giving a strong sense of liveliness in casual conversation. It also supports storytelling and narrative speech with coherent and engaging delivery. Moreover, it enables advanced voice instruction control. like emotional tone, word-level emphasis.
1450
+
1451
  ```python
1452
  import librosa
1453
 
 
1481
 
1482
  #### General Speech Conversation with Custom Voice and Custom System Profile <!-- omit in toc -->
1483
 
 
 
1484
  <details>
1485
  <summary>Click to show custom voice conversation code.</summary>
1486
 
1487
+ MiniCPM-o-4.5 can role-play as a specific character based on an audio prompt and text profile prompt. It mimics the character's voice and adopts their language style in text responses. It also follows profile defined in text profile. In this mode, MiniCPM-o-4.5 sounds **more natural and human-like**.
1488
+
1489
  ```python
1490
  import librosa
1491
 
 
1543
 
1544
  #### Zero-shot Text-to-speech (TTS) <!-- omit in toc -->
1545
 
 
1546
 
1547
  <details>
1548
  <summary>Click to show TTS code.</summary>
1549
 
1550
+ `MiniCPM-o-4.5` supports zero-shot text-to-speech (TTS). In this mode, the model functions as a highly-natural TTS system that can replicate a reference voice.
1551
+
1552
  ```python
1553
  import librosa
1554
 
 
1597
 
1598
  #### Mimick <!-- omit in toc -->
1599
 
 
 
1600
  <details>
1601
  <summary>Click to show mimick code.</summary>
1602
 
1603
+ The `Mimick` task evaluates a model's end-to-end speech modeling capability. The model takes audio input, transcribes it, and reconstructs the original audio with high fidelity, preserving detailed acoustic, paralinguistic, and semantic information. Higher similarity between the reconstructed and original audio indicates stronger end-to-end speech modeling capability.
1604
+
1605
  ```python
1606
  import librosa
1607
 
 
1635
 
1636
  #### Addressing Various Audio Understanding Tasks <!-- omit in toc -->
1637
 
1638
+
1639
+ <details>
1640
+ <summary>Click to show audio understanding code.</summary>
1641
+
1642
  `MiniCPM-o-4.5` can also handle various audio understanding tasks, such as ASR, speaker analysis, general audio captioning, and sound scene tagging.
1643
 
1644
  For audio-to-text tasks, you can use the following prompts:
 
1649
  - General Audio Caption: `Summarize the main content of the audio.`
1650
  - Sound Scene Tagging: `Utilize one keyword to convey the audio's content or the associated scene.`
1651
 
 
 
 
1652
  ```python
1653
  import librosa
1654
 
 
1706
  question = "What is in the image?"
1707
  msgs = [{"role": "user", "content": [image, question]}]
1708
 
1709
+ res = model.chat(msgs=msgs, use_tts_template=False)
 
 
 
 
1710
  print(res)
1711
  ```
1712
 
 
1841
  </details>
1842
 
1843
 
1844
+ ## Deploy a Realtime Web Demo on Your Own Device
1845
+
1846
+ ### Option A (Recommended): **PyTorch Inference with Nvidia GPU** for 100% model precision with no deductions in performance.
1847
+
1848
+ We provide a PyTorch-based [simplified yet full-functional web demo](https://github.com/OpenBMB/minicpm-o-4_5-pytorch-simple-demo) which could boost the model inference performance, supports:
1849
+
1850
+ - full-duplex omnimodal live streaming
1851
+ - full-duplex speech live streaming
1852
+ - half-duplex speech live streaming (under development)
1853
+ - turn-based chat conversation
1854
+ - customizable system prompts
1855
+ - customizable reference audio
1856
+ - simple and readable codebase for continual development
1857
+ - serve as API backend for third-party applications
1858
+
1859
+ Requirements:
1860
+ - Nvidia GPU with at least 28GB GPU memory. *We are working on optimizing the model for lower GPU memory usage.*
1861
+
1862
+ ### Option B: **llama.cpp-omni** for end-side inference with PCs like Mac and low-resource devices.
1863
+
1864
+ With a fully C++ implementation of `MiniCPM-o 4.5` and quantized weights, `llama.cpp-omni` supports:
1865
+ - half-duplex speech realtime conversation
1866
+ - full-duplex omnimodal live streaming
1867
+
1868
+ We provide [ready-to-run guidance](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/demo/web_demo/WebRTC_Demo/README.md) to access the low-latency full-duplex communication directly on your own Mac using our new official Docker image.
1869
+
1870
+ Requirements:
1871
+ - For half-duplex speech realtime conversation: Apple M3/M4/M5 chip with at least 16GB RAM or low-resource Nvidia GPU with at least 12GB GPU memory
1872
+ - For full-duplex omnimodal live streaming: Apple M4 Max chip with at least 24GB RAM or low-resource Nvidia GPU with at least 12GB GPU memory
1873
+
1874
  ## FlagOS
1875
  <details>
1876
  <summary>Click to show FlagOS Usage details.</summary>
 
1960
 
1961
  </details>
1962
 
1963
+ ### vLLM, SGLang, llama.cpp, Ollama
1964
+
1965
+ We support inference with vLLM, SGLang, llama.cpp and Ollama. Refer to our [Cookbook](https://github.com/OpenSQZ/MiniCPM-V-Cookbook) for more details.
1966
+
1967
+ ### LLaMA-Factory, SWIFT
1968
+
1969
+ We support fine-tuning with LLaMA-Factory, SWIFT. Refer to our [Cookbook](https://github.com/OpenSQZ/MiniCPM-V-Cookbook) for more details.
1970
 
1971
  ## MiniCPM-V & o Cookbook
1972
 
1973
+ Discover comprehensive, ready-to-deploy solutions for the MiniCPM-V and MiniCPM-o model series in our structured [Cookbook](https://github.com/OpenSQZ/MiniCPM-V-CookBook), which empowers developers to rapidly implement multimodal AI applications with integrated vision, speech, and live-streaming capabilities. Key features include:
1974
 
1975
  **Easy Usage Documentation**
1976
 
 
1981
 
1982
  We support a wide range of users, from individuals to enterprises and researchers.
1983
 
1984
+ * **Individuals**: Enjoy effortless inference using Ollama ([V4](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/deployment/ollama/minicpm-v4_ollama.md), [o4.5](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/deployment/ollama/minicpm-o4_5_ollama.md)) and Llama.cpp ([V4](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/deployment/llama.cpp/minicpm-v4_llamacpp.md), [o4.5](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/deployment/llama.cpp/minicpm-o4_5_llamacpp.md)) with minimal setup.
1985
+ * **Enterprises**: Achieve high-throughput, scalable performance with vLLM ([V4](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/deployment/vllm/minicpm-v4_vllm.md), [o4.5](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/deployment/vllm/minicpm-o4_5_vllm.md)) and SGLang ([V4](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/deployment/sglang/MiniCPM-v4_sglang.md), [o4.5](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/deployment/sglang/MiniCPM-o4_5_sglang.md)).
1986
  * **Researchers**: Leverage advanced frameworks including [Transformers](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/finetune/finetune_full.md), [LLaMA-Factory](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/finetune/finetune_llamafactory.md), [SWIFT](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/finetune/swift.md), and [Align-anything](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/finetune/align_anything.md) to enable flexible model development and cutting-edge experimentation.
1987
 
1988
  **Versatile Deployment Scenarios**
 
1998
  * The MiniCPM-o/V model weights and code are open-sourced under the [Apache-2.0](https://github.com/OpenBMB/MiniCPM-V/blob/main/LICENSE) license.
1999
 
2000
  #### Statement
2001
+ * As MLLMs, MiniCPM-o/V models generate content by learning a large number of multimodal corpora, but they cannot comprehend, express personal opinions, or make value judgements. Anything generated by MiniCPM-o/V models does not represent the views and positions of the model developers
2002
+ * We will not be liable for any problems arising from the use of MiniCPM-o/V models, including but not limited to data security issues, risk of public opinion, or any risks and problems arising from the misdirection, misuse, dissemination, or misuse of the model.
2003
 
2004
 
2005
  ## Key Techniques and Other Multimodal Projects <!-- omit in toc -->