MiniCPM-o 4.5 โ MLX 4-bit Quantized (Full Multimodal)
4-bit quantized MLX conversion of openbmb/MiniCPM-o-4_5 for fast inference on Apple Silicon (M1/M2/M3/M4).
Includes all modalities: vision, audio input (Whisper), TTS output (CosyVoice2 Llama backbone), and full duplex streaming (real-time screen + audio capture).
Model Details
| Base model | openbmb/MiniCPM-o-4_5 |
| Architecture | SigLIP2 (27L) + Perceiver Resampler + Whisper Encoder (24L) + Qwen3 LLM (36L) + TTS Llama (20L) |
| Parameters | ~8B |
| Quantization | 4-bit (6.031 effective bits) โ LLM quantized, all encoders full precision |
| Size on disk | ~7.0 GB |
| Weight keys | 1925 total (LLM: 907, Vision: 437, Resampler: 17, Audio: 367, Audio Proj: 4, TTS: 193) |
| Framework | MLX via mlx-vlm |
Architecture
Audio (.wav) --> Mel Spectrogram --> WhisperEncoder (24L, 1024d) --> AudioProjection --> AvgPool(5) --\
|
Text --> Tokenizer --> Qwen3 LLM (36L) --> Text Output
| /
Image --> SigLIP2 (27L) --> Perceiver Resampler (64 queries) --------------------------/
\
LLM hidden states --> TTSProjector --> TTS Llama (20L) --> Audio Tokens
Performance (M4 Pro, 24 GB RAM)
| Mode | Prompt Processing | Generation | Peak Memory |
|---|---|---|---|
| Text-only | ~60 tok/s | ~55 tok/s | ~7.1 GB |
| Image + Text | ~150 tok/s | ~49 tok/s | ~8.3 GB |
| Audio + Text | ~85 tok/s | ~55 tok/s | ~8.4 GB |
Capabilities
- Vision: Image understanding, OCR, chart/diagram analysis, math solving, visual reasoning
- Audio input: Speech recognition, audio description, sound classification
- TTS output: Text-to-speech via CosyVoice2 Llama backbone (requires Token2wav vocoder)
- Multilingual: English, Chinese, Indonesian, French, German, etc.
- Full duplex streaming: Real-time screen capture + system audio analysis with continuous LLM output
Requirements
- Apple Silicon Mac (M1 or later)
- Python 3.10+
- ~10 GB free RAM (for full multimodal)
pip install mlx-vlm torch transformers Pillow soundfile
Optional dependencies:
pip install librosa # Audio resampling (if input isn't 16kHz)
pip install minicpmo-utils[all] # Token2wav vocoder for TTS output
pip install mss sounddevice # For streaming mode (screen + audio capture)
For system audio capture on macOS (streaming mode):
brew install blackhole-2ch
Then open Audio MIDI Setup > create a Multi-Output Device combining your speakers + BlackHole 2ch.
Quick Start
Chat Script
A standalone chat_minicpmo.py script is included:
# Image input
python chat_minicpmo.py photo.jpg -p "What's in this image?"
# Audio input
python chat_minicpmo.py --audio speech.wav -p "What is being said?"
# Audio description
python chat_minicpmo.py --audio sound.wav -p "Describe this audio."
# Text-only
python chat_minicpmo.py -p "Explain quantum computing briefly."
# Interactive mode
python chat_minicpmo.py
# Interactive with pre-loaded audio
python chat_minicpmo.py --audio recording.wav
# TTS output (requires minicpmo-utils)
python chat_minicpmo.py -p "Say hello" --tts --tts-output hello.wav
Interactive commands: /image <path> | /audio <path> | /live | /clear | /quit
Streaming Mode (Full Duplex)
Real-time streaming mode captures your screen (1 fps) and system audio (16kHz) simultaneously, feeding them to the model every second for continuous analysis. Think of it as a live AI commentator for whatever's on your screen.
Use cases: real-time video translation, live captioning, accessibility narration, gameplay commentary, meeting summarization.
Architecture
[Screen Capture 1fps] โโโ
โโโ> ChunkSynchronizer โโ> Streaming Whisper โโ> LLM (KV cache) โโ> Text Output
[System Audio 16kHz] โโโโ โ โ โ โ
MelProcessor Whisper KV cache LLM KV cache โ
โผ
TTS Playback (optional)
Quick Start
# Full duplex streaming (captures primary monitor + system audio)
python chat_minicpmo.py --live
# Capture specific screen region
python chat_minicpmo.py --live --capture-region 0,0,1920,1080
# Use mic instead of system audio
python chat_minicpmo.py --live --audio-device "MacBook Pro Microphone"
# With TTS output (speaks responses aloud)
python chat_minicpmo.py --live --tts
# Or start from interactive mode
python chat_minicpmo.py
> /live
Press Ctrl+C to stop streaming.
CLI Options
| Flag | Default | Description |
|---|---|---|
--live |
โ | Enable full duplex streaming mode |
--capture-region |
Primary monitor | Screen region as x,y,w,h |
--audio-device |
BlackHole |
Audio input device name |
--tts |
Off | Enable TTS speech output |
--temp |
0.0 |
Sampling temperature |
--max-tokens |
512 |
Max tokens per chunk response |
How It Works
Screen capture (
mss): Grabs a screenshot at 1 fps, resizes to 448x448, feeds through SigLIP2 vision encoder + Perceiver Resampler (64 tokens).Audio capture (
sounddevice): Records system audio via BlackHole virtual device at 16kHz. Accumulates 1-second chunks.Streaming Whisper encoder: Processes audio incrementally using KV cache โ no need to re-encode previous audio. Conv1d buffers maintain continuity across chunk boundaries. Auto-resets when reaching 1500 positions.
LLM with KV cache continuation: Each chunk's vision + audio embeddings are prefilled into the running LLM cache. The model decides whether to listen or speak based on the input.
Text generation: When the model has something to say, it generates text autoregressively from the cached state. Stops at
<|im_end|>or mode-switch tokens.TTS playback (optional): Generated text is converted to audio tokens via the TTS Llama backbone and played back through speakers using Token2wav.
Output Format
[1] The video shows a person speaking in Indonesian about cooking techniques.
>> chunk=1 mode=listen cache=142tok latency=1850ms mem=8.2GB
[2] They are now demonstrating how to prepare sambal with a mortar and pestle.
>> chunk=2 mode=listen cache=284tok latency=2100ms mem=8.4GB
System Audio Setup (macOS)
To capture system audio (what's playing through your speakers), you need BlackHole:
- Install:
brew install blackhole-2ch - Open Audio MIDI Setup (Spotlight > "Audio MIDI Setup")
- Click + > Create Multi-Output Device
- Check both MacBook Pro Speakers and BlackHole 2ch
- Set this Multi-Output Device as your system output (System Preferences > Sound > Output)
- Run streaming with default
--audio-device BlackHole
Without BlackHole, use your mic: --audio-device "MacBook Pro Microphone"
Memory & Latency Budget
| Component | Memory | Latency |
|---|---|---|
| Model weights | ~7.0 GB | โ |
| LLM KV cache (4096 tok) | ~1.2 GB | โ |
| Whisper KV cache (1500 pos) | ~0.3 GB | โ |
| Screen capture | โ | ~10ms |
| Mel extraction | โ | ~50ms |
| Whisper streaming encode | โ | ~200ms |
| Vision encode | โ | ~150ms |
| LLM prefill (chunk) | โ | ~300ms |
| LLM generate (50 tok) | โ | ~1s |
| Total peak | ~9.0 GB | ~2.2s/chunk |
Files
| File | Description |
|---|---|
streaming.py |
ScreenCapture, AudioCapture, ChunkSynchronizer, DuplexGenerator, TTSPlayback |
chat_minicpmo.py |
CLI with --live flag and /live interactive command |
Python API
from mlx_vlm import load
from mlx_vlm.generate import generate_step
import mlx.core as mx
model, processor = load("andrevp/MiniCPM-o-4_5-MLX-4bit", trust_remote_code=True)
# Text-only
text = "<|im_start|>user\nWhat is machine learning?<|im_end|>\n<|im_start|>assistant\n"
input_ids = mx.array(processor.tokenizer(text, return_tensors="np")["input_ids"])
tokens = []
for token, _ in generate_step(input_ids, model, None, None, temp=0.0):
tok_val = int(token)
tokens.append(tok_val)
if processor.tokenizer.decode([tok_val]) in ["<|im_end|>", "<|endoftext|>"]:
break
print(processor.tokenizer.decode(tokens, skip_special_tokens=True))
Audio Input (Python API)
import soundfile as sf
import numpy as np
from transformers import WhisperFeatureExtractor
# Load and preprocess audio
audio, sr = sf.read("speech.wav", dtype="float32")
if audio.ndim > 1:
audio = audio.mean(axis=1) # stereo to mono
# Extract mel spectrogram
fe = WhisperFeatureExtractor(feature_size=80, sampling_rate=16000, n_fft=400, hop_length=160)
inputs = fe(audio, sampling_rate=16000, return_tensors="pt", padding="max_length", return_attention_mask=True)
mel = inputs["input_features"]
actual_len = inputs["attention_mask"].sum(dim=1)
mel_trimmed = mel[:, :, :int(actual_len[0])]
# Convert to MLX and run through audio encoder
audio_features = mx.array(mel_trimmed.numpy()) # (1, 80, frames)
# Pass audio_features and audio_bound to generate_step via kwargs
# See chat_minicpmo.py for the full pipeline
Component Details
Audio Encoder (Whisper)
- 24-layer Whisper encoder (1024d, 16 heads, 4096 FFN)
- Conv1d feature extraction: mel (80 bins) -> conv1 (stride=1) -> conv2 (stride=2)
- Learned positional embeddings (max 1500 positions)
- Audio projection: 2-layer MLP (1024 -> 4096) with ReLU
- Average pooling with stride 5
TTS Model (CosyVoice2 Llama)
- 20-layer Llama backbone (768d, 12 heads, 3072 FFN)
- Text embedding: 152064 tokens -> 768d
- Audio codebook: 6562 tokens (1 VQ codebook)
- Semantic projector: LLM hidden (4096d) -> TTS hidden (768d)
- Speaker projector: LLM hidden (4096d) -> speaker embedding (768d)
- Autoregressive generation with temperature + top-p sampling
Audio Special Tokens
| Token | ID | Purpose |
|---|---|---|
<|audio_start|> |
151697 | Start of audio placeholder |
<|audio|> |
151698 | Audio token |
<|audio_end|> |
151699 | End of audio placeholder |
<|spk_bos|> |
151700 | Speaker embedding start |
<|spk_eos|> |
151702 | Speaker embedding end |
<|tts_bos|> |
151703 | TTS generation start |
<|tts_eos|> |
151704 | TTS generation end |
Quantization Details
| Component | Keys | Precision | Notes |
|---|---|---|---|
| Qwen3 LLM (36L) | 907 | 4-bit (group_size=64) | Main language model |
| SigLIP2 Vision (27L) | 437 | Full precision | Vision encoder |
| Perceiver Resampler | 17 | Full precision | Cross-attention resampler |
| Whisper Audio (24L) | 367 | Full precision | Audio encoder |
| Audio Projection | 4 | Full precision | 2-layer MLP |
| TTS Llama (20L) | 193 | Full precision | Speech synthesis backbone |
Notes
- Audio input requires 16kHz mono WAV. Install
librosafor automatic resampling from other sample rates. - TTS output generates audio token IDs. Converting to waveform requires the
Token2wavvocoder fromminicpmo-utils[all]. - Processes one image per turn, one audio clip per turn.
- Quantization may slightly reduce output quality compared to the full-precision model.
License
This model is released under the Apache-2.0 license, following the original openbmb/MiniCPM-o-4_5 license.
See the original license for full terms.
Disclaimer
As an LMM, MiniCPM-o 4.5 generates content by learning from a large amount of multimodal corpora, but it cannot comprehend, express personal opinions or make value judgments. Anything generated by MiniCPM-o 4.5 does not represent the views and positions of the model developers. We will not be liable for any problems arising from the use of the MiniCPM-o models, including but not limited to data security issues, risk of public opinion, or any risks and problems arising from the misdirection, misuse, dissemination or misuse of the model.
Credits
- Original model: OpenBMB โ MiniCPM-o 4.5
- MLX framework: Apple ML Explore
- mlx-vlm: Prince Canuma
- Downloads last month
- 171
4-bit
Model tree for andrevp/MiniCPM-o-4_5-MLX-4bit
Base model
openbmb/MiniCPM-o-4_5