Instructions to use 88plug/MiniCPM-o-4.5-W4A16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use 88plug/MiniCPM-o-4.5-W4A16 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="88plug/MiniCPM-o-4.5-W4A16", trust_remote_code=True)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("88plug/MiniCPM-o-4.5-W4A16", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use 88plug/MiniCPM-o-4.5-W4A16 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "88plug/MiniCPM-o-4.5-W4A16"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "88plug/MiniCPM-o-4.5-W4A16",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/88plug/MiniCPM-o-4.5-W4A16

SGLang

How to use 88plug/MiniCPM-o-4.5-W4A16 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "88plug/MiniCPM-o-4.5-W4A16" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "88plug/MiniCPM-o-4.5-W4A16",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "88plug/MiniCPM-o-4.5-W4A16" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "88plug/MiniCPM-o-4.5-W4A16",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use 88plug/MiniCPM-o-4.5-W4A16 with Docker Model Runner:
```
docker model run hf.co/88plug/MiniCPM-o-4.5-W4A16
```

MiniCPM-o-4.5-W4A16

INT4 post-training quantization of openbmb/MiniCPM-o-4.5 — a compact omni model with vision (SigLIP2), audio (Whisper), and speech synthesis (CosyVoice2) built on a Qwen3-8B backbone. ~4–5 GB on disk. Runs on a single 8 GB GPU.

At a Glance

Property	Value
Base model	`openbmb/MiniCPM-o-4.5`
Release tier	Provisional (datafree RTN — re-quant scheduled)
Quant method	datafree RTN W4A16 (AutoRound blocked — KNOWN-FAILURES)
FLAC status	Not measured (T+7d milestone)
Architecture	Qwen3-8B LLM + SigLIP2 vision + Whisper audio + CosyVoice2 TTS
Quant format	compressed-tensors (native vLLM)
Scheme	W4A16
Group size	default (128)
Quantized	`model.llm` transformer Linear layers (Qwen3-8B backbone)
Kept BF16	Vision encoder (SigLIP2), audio encoder (Whisper), TTS (CosyVoice2), embeddings, LM head, norms
Disk size	~4–5 GB
Min GPU	1× RTX 3080 10 GB

Memory Requirements

Configuration	BF16	W8A16	W4A16
Weights	~18 GB	~9 GB	~4–5 GB
Min GPU	1× A100 40 GB	1× RTX 3090 24 GB	1× RTX 3080 10 GB

Note: The non-quantized modal encoders (SigLIP2 ~1 GB, Whisper ~390 MB, CosyVoice2 ~100 MB) are included in all footprint estimates above. Only the Qwen3-8B LLM backbone is quantized to 4-bit.

Quick Start

Tested with vLLM v0.21.0 (vllm/vllm-openai:v0.21.0-cu129-ubuntu2404). Weights are in compressed-tensors format — vLLM detects and loads quantization automatically. No --quantization flag needed.

vLLM — text output

docker run --gpus device=0 -p 8080:8080 \
  vllm/vllm-openai:v0.21.0-cu129-ubuntu2404 vllm serve \
  88plug/MiniCPM-o-4.5-W4A16 \
  --kv-cache-dtype fp8 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90

Weights are in compressed-tensors format — no --quantization flag needed. Mainline vLLM returns text only; CosyVoice2 TTS output is not supported.

Python client

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="token")

response = client.chat.completions.create(
    model="88plug/MiniCPM-o-4.5-W4A16",
    messages=[
        {"role": "user", "content": "Describe the architecture of MiniCPM-o 4.5."}
    ],
    max_tokens=512,
)
print(response.choices[0].message.content)

Quantization Design

What is quantized

Only the Qwen3-8B LLM backbone (model.llm) is quantized. AutoRound (blocked — see KNOWN-FAILURES) applies W4A16 to all Linear layers within model.llm, using a round-to-nearest rotation-based optimization with 200 calibration iterations per block.

What stays BF16

Component	Module path	Precision	Reason
Vision encoder	`vision_model.*`	BF16	Excluded from recipe
Audio encoder	`audio_model.*`	BF16	Excluded from recipe
CosyVoice2 TTS decoder	`tts.*`	BF16	Excluded from recipe
Embedding layers	`re:.*embed_tokens$`	BF16	Standard practice (ignore list)
Layer norms	`re:.*norm$`	BF16	Standard practice (ignore list)
LM head	`lm_head`	BF16	Standard practice (ignore list)

The full MiniCPM-o-4.5 checkpoint is saved via model.save_pretrained() after in-place quantization of model.llm, so the output contains all modalities — vision, audio, and TTS encoders remain at full BF16 fidelity.

Implementation notes

MiniCPM-o-4.5 required four patches to run cleanly through llmcompressor:

get_imports patch — filters minicpmo, librosa, and soundfile imports to avoid the librosa→soxr cascade during quantization.
MiniCPMTTSConfig.__getattr__ patch — backfills top_p, top_k, and related attributes missing from the shipped config.json.
_move_missing_keys_from_meta_to_device wrap — handles all_tied_weights_keys not being set by MiniCPMO's remote code under transformers 5.8.1.
is_mllm_model=False override — forces AutoRound (blocked — see KNOWN-FAILURES) through the standard LLM path instead of the multimodal MLLM compressor, which would fail trying to load a processor from model.llm directly.

Additionally, torch.nn.Module.apply and torch.nn.Module.train are replaced with iterative equivalents to avoid stack overflow on MiniCPM-o's ~985-deep module tree.

Quality Targets

Metric	Target
KL divergence vs BF16	< 0.014
MMLU recovery	≥ 99%
RULER@128k	≥ 97%

Competitor Comparables

MiniCPM-o-4.5 is an omni model — meaningful comparisons must also support vision + audio input. As of publication, no other compressed-tensors or vLLM-native quantization of this model exists.

Model	Source	Format	Compare angle
`openbmb/MiniCPM-o-4.5`	official	BF16	Quality ceiling
`88plug/MiniCPM-o-4.5-W8A16`	88plug	compressed-tensors W8A16	Higher-precision sibling
`88plug/MiniCPM-o-4.5-W4A16`	88plug	compressed-tensors W4A16	This model

First-to-market claim: No compressed-tensors or vLLM-native W4A16 quant was found for this model at publication time. This is the only production-ready W4A16 quant for direct vLLM serving.

Benchmarks

Results pending.

Engine	Format	Batch	ctx	tok/s	TTFT p50	TTFT p99	VRAM
vLLM v0.21.0	W4A16 compressed-tensors	1	32k	—	—	—	—
vLLM v0.21.0	W4A16 compressed-tensors	8	32k	—	—	—	—
vLLM v0.21.0	W4A16 compressed-tensors	1	128k	—	—	—	—
SGLang v0.5.8	BF16 (baseline)	1	32k	—	—	—	—
llama.cpp b9297	Q4_K_M GGUF	1	32k	—	—	—	—
llama.cpp b9297	IQ4_XS GGUF	1	32k	—	—	—	—

Hardware: A6000 48 GB, CUDA 12.9, driver 570.

SGLang

SGLang does not natively support compressed-tensors. To use this model with SGLang, serve the BF16 base (openbmb/MiniCPM-o-4.5) or an AWQ variant.

docker run --gpus device=0 -p 30000:30000 \
  lmsysorg/sglang:v0.5.8-cu129 python -m sglang.launch_server \
  --model-path openbmb/MiniCPM-o-4.5 \
  --tp 1 \
  --mem-fraction-static 0.85 \
  --port 30000

SGLang results are BF16 baseline — useful as a throughput ceiling reference, not a direct quality comparison to this quant.

llama.cpp

Mainline llama.cpp supports MiniCPM-V (vision + text). For full CosyVoice2 speech output, use the tc-mb/llama.cpp-omni fork. Convert and quantize from the BF16 base — do not convert from compressed-tensors weights.

python convert_hf_to_gguf.py openbmb/MiniCPM-o-4.5 \
  --outfile MiniCPM-o-4.5-BF16.gguf

llama-quantize MiniCPM-o-4.5-BF16.gguf MiniCPM-o-4.5-Q4_K_M.gguf Q4_K_M
llama-quantize --imatrix calibration_datav3.txt \
  MiniCPM-o-4.5-BF16.gguf MiniCPM-o-4.5-IQ4_XS.gguf IQ4_XS

llama-server \
  --model MiniCPM-o-4.5-Q4_K_M.gguf \
  --n-gpu-layers 999 \
  --ctx-size 32768 \
  --port 8081

Technical Details

Parameter	Value
Quantizer	AutoRound (blocked — see KNOWN-FAILURES) (via llmcompressor `AutoRound (blocked — see KNOWN-FAILURES)Modifier`)
Targets	`["Linear"]` within `model.llm`
Scheme	`W4A16`
Pipeline	`sequential`
Max seq length	2048
Ignore list	`lm_head`, `re:.embed_tokens$`, `re:.norm$`
Activations	FP16 (unquantized — W4A16)
trust_remote_code	required

Citation

@misc{minicpmo,
  title  = {MiniCPM-o: A GPT-4o Level Multimodal LLM on Your Phone},
  author = {MiniCPM Team, OpenBMB},
  year   = {2025},
  url    = {https://huggingface.co/openbmb/MiniCPM-o-4.5}
}

About

88plug AI Lab ships compressed-tensors quantizations for native vLLM v0.21.0+ deployment.

This release: Provisional tier — datafree RTN (weight-only rounding, no calibration corpus). A gold AutoRound re-quant is scheduled; 88plug architecture forbids new provisional W4A16 uploads.

Browse all releases → huggingface.co/88plug

Downloads last month: 456

Safetensors

Model size

9B params

Tensor type

BF16