Instructions to use andrevp/MiniCPM-o-4_5-MLX-4bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use andrevp/MiniCPM-o-4_5-MLX-4bit with MLX:

# Make sure mlx-vlm is installed
# pip install --upgrade mlx-vlm

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

# Load the model
model, processor = load("andrevp/MiniCPM-o-4_5-MLX-4bit")
config = load_config("andrevp/MiniCPM-o-4_5-MLX-4bit")

# Prepare input
image = ["http://images.cocodataset.org/val2017/000000039769.jpg"]
prompt = "Describe this image."

# Apply chat template
formatted_prompt = apply_chat_template(
    processor, config, prompt, num_images=1
)

# Generate output
output = generate(model, processor, formatted_prompt, image)
print(output)

Notebooks
Google Colab
Kaggle
Local Apps Settings
LM Studio

How to use andrevp/MiniCPM-o-4_5-MLX-4bit with Pi:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "andrevp/MiniCPM-o-4_5-MLX-4bit"

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "mlx-lm": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "andrevp/MiniCPM-o-4_5-MLX-4bit"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use andrevp/MiniCPM-o-4_5-MLX-4bit with Hermes Agent:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "andrevp/MiniCPM-o-4_5-MLX-4bit"

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default andrevp/MiniCPM-o-4_5-MLX-4bit

Run Hermes

hermes

OpenClaw new

How to use andrevp/MiniCPM-o-4_5-MLX-4bit with OpenClaw:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "andrevp/MiniCPM-o-4_5-MLX-4bit"

Configure OpenClaw

# Install OpenClaw:
npm install -g openclaw@latest
# Register the local server and set it as the default model:
openclaw onboard --non-interactive --mode local \
  --auth-choice custom-api-key \
  --custom-base-url http://127.0.0.1:8080/v1 \
  --custom-model-id "andrevp/MiniCPM-o-4_5-MLX-4bit" \
  --custom-provider-id mlx-lm \
  --custom-compatibility openai \
  --custom-text-input \
  --accept-risk \
  --skip-health

Run OpenClaw

openclaw agent --local --agent main --message "Hello from Hugging Face"

MiniCPM-o 4.5 — MLX 4-bit Quantized (Full Multimodal)

4-bit quantized MLX conversion of openbmb/MiniCPM-o-4_5 for fast inference on Apple Silicon (M1/M2/M3/M4).

Includes all modalities: vision, audio input (Whisper), TTS output (CosyVoice2 Llama backbone), and full duplex streaming (real-time screen + audio capture).

Model Details


Base model	openbmb/MiniCPM-o-4_5
Architecture	SigLIP2 (27L) + Perceiver Resampler + Whisper Encoder (24L) + Qwen3 LLM (36L) + TTS Llama (20L)
Parameters	~8B
Quantization	4-bit (6.031 effective bits) — LLM quantized, all encoders full precision
Size on disk	~7.0 GB
Weight keys	1925 total (LLM: 907, Vision: 437, Resampler: 17, Audio: 367, Audio Proj: 4, TTS: 193)
Framework	MLX via mlx-vlm

Architecture

Audio (.wav) --> Mel Spectrogram --> WhisperEncoder (24L, 1024d) --> AudioProjection --> AvgPool(5) --\
                                                                                                     |
                                                              Text --> Tokenizer --> Qwen3 LLM (36L) --> Text Output
                                                                                       |            /
Image --> SigLIP2 (27L) --> Perceiver Resampler (64 queries) --------------------------/
                                                                                       \
                                                         LLM hidden states --> TTSProjector --> TTS Llama (20L) --> Audio Tokens

Performance (M4 Pro, 24 GB RAM)

Mode	Prompt Processing	Generation	Peak Memory
Text-only	~60 tok/s	~55 tok/s	~7.1 GB
Image + Text	~150 tok/s	~49 tok/s	~8.3 GB
Audio + Text	~85 tok/s	~55 tok/s	~8.4 GB

Capabilities

Vision: Image understanding, OCR, chart/diagram analysis, math solving, visual reasoning
Audio input: Speech recognition, audio description, sound classification
TTS output: Text-to-speech via CosyVoice2 Llama backbone (requires Token2wav vocoder)
Multilingual: English, Chinese, Indonesian, French, German, etc.
Full duplex streaming: Real-time screen capture + system audio analysis with continuous LLM output

Requirements

Apple Silicon Mac (M1 or later)
Python 3.10+
~10 GB free RAM (for full multimodal)

pip install mlx-vlm torch transformers Pillow soundfile

Optional dependencies:

pip install librosa                # Audio resampling (if input isn't 16kHz)
pip install minicpmo-utils[all]    # Token2wav vocoder for TTS output
pip install mss sounddevice        # For streaming mode (screen + audio capture)

For system audio capture on macOS (streaming mode):

brew install blackhole-2ch

Then open Audio MIDI Setup > create a Multi-Output Device combining your speakers + BlackHole 2ch.

Quick Start

Chat Script

A standalone chat_minicpmo.py script is included:

# Image input
python chat_minicpmo.py photo.jpg -p "What's in this image?"

# Audio input
python chat_minicpmo.py --audio speech.wav -p "What is being said?"

# Audio description
python chat_minicpmo.py --audio sound.wav -p "Describe this audio."

# Text-only
python chat_minicpmo.py -p "Explain quantum computing briefly."

# Interactive mode
python chat_minicpmo.py

# Interactive with pre-loaded audio
python chat_minicpmo.py --audio recording.wav

# TTS output (requires minicpmo-utils)
python chat_minicpmo.py -p "Say hello" --tts --tts-output hello.wav

Interactive commands: /image <path> | /audio <path> | /live | /clear | /quit

Streaming Mode (Full Duplex)

Real-time streaming mode captures your screen (1 fps) and system audio (16kHz) simultaneously, feeding them to the model every second for continuous analysis. Think of it as a live AI commentator for whatever's on your screen.

Use cases: real-time video translation, live captioning, accessibility narration, gameplay commentary, meeting summarization.

Architecture

[Screen Capture 1fps] ──┐
                        ├──> ChunkSynchronizer ──> Streaming Whisper ──> LLM (KV cache) ──> Text Output
[System Audio 16kHz] ───┘         ↑                      ↑                    ↑                  │
                            MelProcessor          Whisper KV cache       LLM KV cache            │
                                                                                                  ▼
                                                                                          TTS Playback (optional)

Quick Start

# Full duplex streaming (captures primary monitor + system audio)
python chat_minicpmo.py --live

# Capture specific screen region
python chat_minicpmo.py --live --capture-region 0,0,1920,1080

# Use mic instead of system audio
python chat_minicpmo.py --live --audio-device "MacBook Pro Microphone"

# With TTS output (speaks responses aloud)
python chat_minicpmo.py --live --tts

# Or start from interactive mode
python chat_minicpmo.py
> /live

Press Ctrl+C to stop streaming.

CLI Options

Flag	Default	Description
`--live`	—	Enable full duplex streaming mode
`--capture-region`	Primary monitor	Screen region as `x,y,w,h`
`--audio-device`	`BlackHole`	Audio input device name
`--tts`	Off	Enable TTS speech output
`--temp`	`0.0`	Sampling temperature
`--max-tokens`	`512`	Max tokens per chunk response

How It Works

Screen capture (mss): Grabs a screenshot at 1 fps, resizes to 448x448, feeds through SigLIP2 vision encoder + Perceiver Resampler (64 tokens).
Audio capture (sounddevice): Records system audio via BlackHole virtual device at 16kHz. Accumulates 1-second chunks.
Streaming Whisper encoder: Processes audio incrementally using KV cache — no need to re-encode previous audio. Conv1d buffers maintain continuity across chunk boundaries. Auto-resets when reaching 1500 positions.
LLM with KV cache continuation: Each chunk's vision + audio embeddings are prefilled into the running LLM cache. The model decides whether to listen or speak based on the input.
Text generation: When the model has something to say, it generates text autoregressively from the cached state. Stops at <|im_end|> or mode-switch tokens.
TTS playback (optional): Generated text is converted to audio tokens via the TTS Llama backbone and played back through speakers using Token2wav.

Output Format

[1] The video shows a person speaking in Indonesian about cooking techniques.
  >> chunk=1 mode=listen cache=142tok latency=1850ms mem=8.2GB
[2] They are now demonstrating how to prepare sambal with a mortar and pestle.
  >> chunk=2 mode=listen cache=284tok latency=2100ms mem=8.4GB

System Audio Setup (macOS)

To capture system audio (what's playing through your speakers), you need BlackHole:

Install: brew install blackhole-2ch
Open Audio MIDI Setup (Spotlight > "Audio MIDI Setup")
Click + > Create Multi-Output Device
Check both MacBook Pro Speakers and BlackHole 2ch
Set this Multi-Output Device as your system output (System Preferences > Sound > Output)
Run streaming with default --audio-device BlackHole

Without BlackHole, use your mic: --audio-device "MacBook Pro Microphone"

Memory & Latency Budget

Component	Memory	Latency
Model weights	~7.0 GB	—
LLM KV cache (4096 tok)	~1.2 GB	—
Whisper KV cache (1500 pos)	~0.3 GB	—
Screen capture	—	~10ms
Mel extraction	—	~50ms
Whisper streaming encode	—	~200ms
Vision encode	—	~150ms
LLM prefill (chunk)	—	~300ms
LLM generate (50 tok)	—	~1s
Total peak	~9.0 GB	~2.2s/chunk

Files

File	Description
`streaming.py`	ScreenCapture, AudioCapture, ChunkSynchronizer, DuplexGenerator, TTSPlayback
`chat_minicpmo.py`	CLI with `--live` flag and `/live` interactive command

Python API

from mlx_vlm import load
from mlx_vlm.generate import generate_step
import mlx.core as mx

model, processor = load("andrevp/MiniCPM-o-4_5-MLX-4bit", trust_remote_code=True)

# Text-only
text = "<|im_start|>user\nWhat is machine learning?<|im_end|>\n<|im_start|>assistant\n"
input_ids = mx.array(processor.tokenizer(text, return_tensors="np")["input_ids"])

tokens = []
for token, _ in generate_step(input_ids, model, None, None, temp=0.0):
    tok_val = int(token)
    tokens.append(tok_val)
    if processor.tokenizer.decode([tok_val]) in ["<|im_end|>", "<|endoftext|>"]:
        break

print(processor.tokenizer.decode(tokens, skip_special_tokens=True))

Audio Input (Python API)

import soundfile as sf
import numpy as np
from transformers import WhisperFeatureExtractor

# Load and preprocess audio
audio, sr = sf.read("speech.wav", dtype="float32")
if audio.ndim > 1:
    audio = audio.mean(axis=1)  # stereo to mono

# Extract mel spectrogram
fe = WhisperFeatureExtractor(feature_size=80, sampling_rate=16000, n_fft=400, hop_length=160)
inputs = fe(audio, sampling_rate=16000, return_tensors="pt", padding="max_length", return_attention_mask=True)
mel = inputs["input_features"]
actual_len = inputs["attention_mask"].sum(dim=1)
mel_trimmed = mel[:, :, :int(actual_len[0])]

# Convert to MLX and run through audio encoder
audio_features = mx.array(mel_trimmed.numpy())  # (1, 80, frames)

# Pass audio_features and audio_bound to generate_step via kwargs
# See chat_minicpmo.py for the full pipeline

Component Details

Audio Encoder (Whisper)

24-layer Whisper encoder (1024d, 16 heads, 4096 FFN)
Conv1d feature extraction: mel (80 bins) -> conv1 (stride=1) -> conv2 (stride=2)
Learned positional embeddings (max 1500 positions)
Audio projection: 2-layer MLP (1024 -> 4096) with ReLU
Average pooling with stride 5

TTS Model (CosyVoice2 Llama)

20-layer Llama backbone (768d, 12 heads, 3072 FFN)
Text embedding: 152064 tokens -> 768d
Audio codebook: 6562 tokens (1 VQ codebook)
Semantic projector: LLM hidden (4096d) -> TTS hidden (768d)
Speaker projector: LLM hidden (4096d) -> speaker embedding (768d)
Autoregressive generation with temperature + top-p sampling

Audio Special Tokens

Token	ID	Purpose
`<\|audio_start\|>`	151697	Start of audio placeholder
`<\|audio\|>`	151698	Audio token
`<\|audio_end\|>`	151699	End of audio placeholder
`<\|spk_bos\|>`	151700	Speaker embedding start
`<\|spk_eos\|>`	151702	Speaker embedding end
`<\|tts_bos\|>`	151703	TTS generation start
`<\|tts_eos\|>`	151704	TTS generation end

Quantization Details

Component	Keys	Precision	Notes
Qwen3 LLM (36L)	907	4-bit (group_size=64)	Main language model
SigLIP2 Vision (27L)	437	Full precision	Vision encoder
Perceiver Resampler	17	Full precision	Cross-attention resampler
Whisper Audio (24L)	367	Full precision	Audio encoder
Audio Projection	4	Full precision	2-layer MLP
TTS Llama (20L)	193	Full precision	Speech synthesis backbone

Notes

Audio input requires 16kHz mono WAV. Install librosa for automatic resampling from other sample rates.
TTS output generates audio token IDs. Converting to waveform requires the Token2wav vocoder from minicpmo-utils[all].
Processes one image per turn, one audio clip per turn.
Quantization may slightly reduce output quality compared to the full-precision model.

License

This model is released under the Apache-2.0 license, following the original openbmb/MiniCPM-o-4_5 license.

See the original license for full terms.

Disclaimer

As an LMM, MiniCPM-o 4.5 generates content by learning from a large amount of multimodal corpora, but it cannot comprehend, express personal opinions or make value judgments. Anything generated by MiniCPM-o 4.5 does not represent the views and positions of the model developers. We will not be liable for any problems arising from the use of the MiniCPM-o models, including but not limited to data security issues, risk of public opinion, or any risks and problems arising from the misdirection, misuse, dissemination or misuse of the model.

Credits

Original model: OpenBMB — MiniCPM-o 4.5
MLX framework: Apple ML Explore
mlx-vlm: Prince Canuma

Downloads last month: 267

Safetensors

Model size

2B params

Tensor type

BF16

U32

MLX

Hardware compatibility

4-bit

Model tree for andrevp/MiniCPM-o-4_5-MLX-4bit

Base model

openbmb/MiniCPM-o-4_5

Quantized

(8)

this model