MiniCPM-o 4.5 โ€” MLX 4-bit Quantized (Full Multimodal)

4-bit quantized MLX conversion of openbmb/MiniCPM-o-4_5 for fast inference on Apple Silicon (M1/M2/M3/M4).

Includes all modalities: vision, audio input (Whisper), TTS output (CosyVoice2 Llama backbone), and full duplex streaming (real-time screen + audio capture).

Model Details

Base model openbmb/MiniCPM-o-4_5
Architecture SigLIP2 (27L) + Perceiver Resampler + Whisper Encoder (24L) + Qwen3 LLM (36L) + TTS Llama (20L)
Parameters ~8B
Quantization 4-bit (6.031 effective bits) โ€” LLM quantized, all encoders full precision
Size on disk ~7.0 GB
Weight keys 1925 total (LLM: 907, Vision: 437, Resampler: 17, Audio: 367, Audio Proj: 4, TTS: 193)
Framework MLX via mlx-vlm

Architecture

Audio (.wav) --> Mel Spectrogram --> WhisperEncoder (24L, 1024d) --> AudioProjection --> AvgPool(5) --\
                                                                                                     |
                                                              Text --> Tokenizer --> Qwen3 LLM (36L) --> Text Output
                                                                                       |            /
Image --> SigLIP2 (27L) --> Perceiver Resampler (64 queries) --------------------------/
                                                                                       \
                                                         LLM hidden states --> TTSProjector --> TTS Llama (20L) --> Audio Tokens

Performance (M4 Pro, 24 GB RAM)

Mode Prompt Processing Generation Peak Memory
Text-only ~60 tok/s ~55 tok/s ~7.1 GB
Image + Text ~150 tok/s ~49 tok/s ~8.3 GB
Audio + Text ~85 tok/s ~55 tok/s ~8.4 GB

Capabilities

  • Vision: Image understanding, OCR, chart/diagram analysis, math solving, visual reasoning
  • Audio input: Speech recognition, audio description, sound classification
  • TTS output: Text-to-speech via CosyVoice2 Llama backbone (requires Token2wav vocoder)
  • Multilingual: English, Chinese, Indonesian, French, German, etc.
  • Full duplex streaming: Real-time screen capture + system audio analysis with continuous LLM output

Requirements

  • Apple Silicon Mac (M1 or later)
  • Python 3.10+
  • ~10 GB free RAM (for full multimodal)
pip install mlx-vlm torch transformers Pillow soundfile

Optional dependencies:

pip install librosa                # Audio resampling (if input isn't 16kHz)
pip install minicpmo-utils[all]    # Token2wav vocoder for TTS output
pip install mss sounddevice        # For streaming mode (screen + audio capture)

For system audio capture on macOS (streaming mode):

brew install blackhole-2ch

Then open Audio MIDI Setup > create a Multi-Output Device combining your speakers + BlackHole 2ch.

Quick Start

Chat Script

A standalone chat_minicpmo.py script is included:

# Image input
python chat_minicpmo.py photo.jpg -p "What's in this image?"

# Audio input
python chat_minicpmo.py --audio speech.wav -p "What is being said?"

# Audio description
python chat_minicpmo.py --audio sound.wav -p "Describe this audio."

# Text-only
python chat_minicpmo.py -p "Explain quantum computing briefly."

# Interactive mode
python chat_minicpmo.py

# Interactive with pre-loaded audio
python chat_minicpmo.py --audio recording.wav

# TTS output (requires minicpmo-utils)
python chat_minicpmo.py -p "Say hello" --tts --tts-output hello.wav

Interactive commands: /image <path> | /audio <path> | /live | /clear | /quit

Streaming Mode (Full Duplex)

Real-time streaming mode captures your screen (1 fps) and system audio (16kHz) simultaneously, feeding them to the model every second for continuous analysis. Think of it as a live AI commentator for whatever's on your screen.

Use cases: real-time video translation, live captioning, accessibility narration, gameplay commentary, meeting summarization.

Architecture

[Screen Capture 1fps] โ”€โ”€โ”
                        โ”œโ”€โ”€> ChunkSynchronizer โ”€โ”€> Streaming Whisper โ”€โ”€> LLM (KV cache) โ”€โ”€> Text Output
[System Audio 16kHz] โ”€โ”€โ”€โ”˜         โ†‘                      โ†‘                    โ†‘                  โ”‚
                            MelProcessor          Whisper KV cache       LLM KV cache            โ”‚
                                                                                                  โ–ผ
                                                                                          TTS Playback (optional)

Quick Start

# Full duplex streaming (captures primary monitor + system audio)
python chat_minicpmo.py --live

# Capture specific screen region
python chat_minicpmo.py --live --capture-region 0,0,1920,1080

# Use mic instead of system audio
python chat_minicpmo.py --live --audio-device "MacBook Pro Microphone"

# With TTS output (speaks responses aloud)
python chat_minicpmo.py --live --tts

# Or start from interactive mode
python chat_minicpmo.py
> /live

Press Ctrl+C to stop streaming.

CLI Options

Flag Default Description
--live โ€” Enable full duplex streaming mode
--capture-region Primary monitor Screen region as x,y,w,h
--audio-device BlackHole Audio input device name
--tts Off Enable TTS speech output
--temp 0.0 Sampling temperature
--max-tokens 512 Max tokens per chunk response

How It Works

  1. Screen capture (mss): Grabs a screenshot at 1 fps, resizes to 448x448, feeds through SigLIP2 vision encoder + Perceiver Resampler (64 tokens).

  2. Audio capture (sounddevice): Records system audio via BlackHole virtual device at 16kHz. Accumulates 1-second chunks.

  3. Streaming Whisper encoder: Processes audio incrementally using KV cache โ€” no need to re-encode previous audio. Conv1d buffers maintain continuity across chunk boundaries. Auto-resets when reaching 1500 positions.

  4. LLM with KV cache continuation: Each chunk's vision + audio embeddings are prefilled into the running LLM cache. The model decides whether to listen or speak based on the input.

  5. Text generation: When the model has something to say, it generates text autoregressively from the cached state. Stops at <|im_end|> or mode-switch tokens.

  6. TTS playback (optional): Generated text is converted to audio tokens via the TTS Llama backbone and played back through speakers using Token2wav.

Output Format

[1] The video shows a person speaking in Indonesian about cooking techniques.
  >> chunk=1 mode=listen cache=142tok latency=1850ms mem=8.2GB
[2] They are now demonstrating how to prepare sambal with a mortar and pestle.
  >> chunk=2 mode=listen cache=284tok latency=2100ms mem=8.4GB

System Audio Setup (macOS)

To capture system audio (what's playing through your speakers), you need BlackHole:

  1. Install: brew install blackhole-2ch
  2. Open Audio MIDI Setup (Spotlight > "Audio MIDI Setup")
  3. Click + > Create Multi-Output Device
  4. Check both MacBook Pro Speakers and BlackHole 2ch
  5. Set this Multi-Output Device as your system output (System Preferences > Sound > Output)
  6. Run streaming with default --audio-device BlackHole

Without BlackHole, use your mic: --audio-device "MacBook Pro Microphone"

Memory & Latency Budget

Component Memory Latency
Model weights ~7.0 GB โ€”
LLM KV cache (4096 tok) ~1.2 GB โ€”
Whisper KV cache (1500 pos) ~0.3 GB โ€”
Screen capture โ€” ~10ms
Mel extraction โ€” ~50ms
Whisper streaming encode โ€” ~200ms
Vision encode โ€” ~150ms
LLM prefill (chunk) โ€” ~300ms
LLM generate (50 tok) โ€” ~1s
Total peak ~9.0 GB ~2.2s/chunk

Files

File Description
streaming.py ScreenCapture, AudioCapture, ChunkSynchronizer, DuplexGenerator, TTSPlayback
chat_minicpmo.py CLI with --live flag and /live interactive command

Python API

from mlx_vlm import load
from mlx_vlm.generate import generate_step
import mlx.core as mx

model, processor = load("andrevp/MiniCPM-o-4_5-MLX-4bit", trust_remote_code=True)

# Text-only
text = "<|im_start|>user\nWhat is machine learning?<|im_end|>\n<|im_start|>assistant\n"
input_ids = mx.array(processor.tokenizer(text, return_tensors="np")["input_ids"])

tokens = []
for token, _ in generate_step(input_ids, model, None, None, temp=0.0):
    tok_val = int(token)
    tokens.append(tok_val)
    if processor.tokenizer.decode([tok_val]) in ["<|im_end|>", "<|endoftext|>"]:
        break

print(processor.tokenizer.decode(tokens, skip_special_tokens=True))

Audio Input (Python API)

import soundfile as sf
import numpy as np
from transformers import WhisperFeatureExtractor

# Load and preprocess audio
audio, sr = sf.read("speech.wav", dtype="float32")
if audio.ndim > 1:
    audio = audio.mean(axis=1)  # stereo to mono

# Extract mel spectrogram
fe = WhisperFeatureExtractor(feature_size=80, sampling_rate=16000, n_fft=400, hop_length=160)
inputs = fe(audio, sampling_rate=16000, return_tensors="pt", padding="max_length", return_attention_mask=True)
mel = inputs["input_features"]
actual_len = inputs["attention_mask"].sum(dim=1)
mel_trimmed = mel[:, :, :int(actual_len[0])]

# Convert to MLX and run through audio encoder
audio_features = mx.array(mel_trimmed.numpy())  # (1, 80, frames)

# Pass audio_features and audio_bound to generate_step via kwargs
# See chat_minicpmo.py for the full pipeline

Component Details

Audio Encoder (Whisper)

  • 24-layer Whisper encoder (1024d, 16 heads, 4096 FFN)
  • Conv1d feature extraction: mel (80 bins) -> conv1 (stride=1) -> conv2 (stride=2)
  • Learned positional embeddings (max 1500 positions)
  • Audio projection: 2-layer MLP (1024 -> 4096) with ReLU
  • Average pooling with stride 5

TTS Model (CosyVoice2 Llama)

  • 20-layer Llama backbone (768d, 12 heads, 3072 FFN)
  • Text embedding: 152064 tokens -> 768d
  • Audio codebook: 6562 tokens (1 VQ codebook)
  • Semantic projector: LLM hidden (4096d) -> TTS hidden (768d)
  • Speaker projector: LLM hidden (4096d) -> speaker embedding (768d)
  • Autoregressive generation with temperature + top-p sampling

Audio Special Tokens

Token ID Purpose
<|audio_start|> 151697 Start of audio placeholder
<|audio|> 151698 Audio token
<|audio_end|> 151699 End of audio placeholder
<|spk_bos|> 151700 Speaker embedding start
<|spk_eos|> 151702 Speaker embedding end
<|tts_bos|> 151703 TTS generation start
<|tts_eos|> 151704 TTS generation end

Quantization Details

Component Keys Precision Notes
Qwen3 LLM (36L) 907 4-bit (group_size=64) Main language model
SigLIP2 Vision (27L) 437 Full precision Vision encoder
Perceiver Resampler 17 Full precision Cross-attention resampler
Whisper Audio (24L) 367 Full precision Audio encoder
Audio Projection 4 Full precision 2-layer MLP
TTS Llama (20L) 193 Full precision Speech synthesis backbone

Notes

  • Audio input requires 16kHz mono WAV. Install librosa for automatic resampling from other sample rates.
  • TTS output generates audio token IDs. Converting to waveform requires the Token2wav vocoder from minicpmo-utils[all].
  • Processes one image per turn, one audio clip per turn.
  • Quantization may slightly reduce output quality compared to the full-precision model.

License

This model is released under the Apache-2.0 license, following the original openbmb/MiniCPM-o-4_5 license.

See the original license for full terms.

Disclaimer

As an LMM, MiniCPM-o 4.5 generates content by learning from a large amount of multimodal corpora, but it cannot comprehend, express personal opinions or make value judgments. Anything generated by MiniCPM-o 4.5 does not represent the views and positions of the model developers. We will not be liable for any problems arising from the use of the MiniCPM-o models, including but not limited to data security issues, risk of public opinion, or any risks and problems arising from the misdirection, misuse, dissemination or misuse of the model.

Credits

Downloads last month
171
Safetensors
Model size
2B params
Tensor type
BF16
ยท
U32
ยท
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for andrevp/MiniCPM-o-4_5-MLX-4bit

Quantized
(1)
this model