MiniCPM-o-4.5-W4A16

INT4 post-training quantization of openbmb/MiniCPM-o-4.5 β€” a compact omni model with vision (SigLIP2), audio (Whisper), and speech synthesis (CosyVoice2) built on a Qwen3-8B backbone. ~4–5 GB on disk. Runs on a single 8 GB GPU.


At a Glance

Property Value
Base model openbmb/MiniCPM-o-4.5
Architecture Qwen3-8B LLM + SigLIP2 vision + Whisper audio + CosyVoice2 TTS
Quant method AutoRound (llmcompressor)
Quant format compressed-tensors (native vLLM)
Scheme W4A16
Group size default (128)
Calibration iters 200
Quantized model.llm transformer Linear layers (Qwen3-8B backbone)
Kept BF16 Vision encoder (SigLIP2), audio encoder (Whisper), TTS (CosyVoice2), embeddings, LM head, norms
Calibration data 512Γ— UltraChat + 512Γ— Wikitext-103, seq 2048
Disk size ~4–5 GB
Min GPU 1Γ— RTX 3080 10 GB

Memory Requirements

Configuration BF16 W8A16 W4A16
Weights ~18 GB ~9 GB ~4–5 GB
Min GPU 1Γ— A100 40 GB 1Γ— RTX 3090 24 GB 1Γ— RTX 3080 10 GB

Note: The non-quantized modal encoders (SigLIP2 ~1 GB, Whisper ~390 MB, CosyVoice2 ~100 MB) are included in all footprint estimates above. Only the Qwen3-8B LLM backbone is quantized to 4-bit.


Quick Start

Tested with vLLM v0.21.0 (vllm/vllm-openai:v0.21.0-cu129-ubuntu2404). Weights are in compressed-tensors format β€” vLLM detects and loads quantization automatically. No --quantization flag needed.

vLLM β€” text output

docker run --gpus device=0 -p 8080:8080 \
  vllm/vllm-openai:v0.21.0-cu129-ubuntu2404 vllm serve \
  88plug/MiniCPM-o-4.5-W4A16 \
  --kv-cache-dtype fp8 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90

Weights are in compressed-tensors format β€” no --quantization flag needed. Mainline vLLM returns text only; CosyVoice2 TTS output is not supported.

Python client

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="token")

response = client.chat.completions.create(
    model="88plug/MiniCPM-o-4.5-W4A16",
    messages=[
        {"role": "user", "content": "Describe the architecture of MiniCPM-o 4.5."}
    ],
    max_tokens=512,
)
print(response.choices[0].message.content)

Quantization Design

What is quantized

Only the Qwen3-8B LLM backbone (model.llm) is quantized. AutoRound applies W4A16 to all Linear layers within model.llm, using a round-to-nearest rotation-based optimization with 200 calibration iterations per block.

What stays BF16

Component Module path Precision Reason
Vision encoder vision_model.* BF16 Excluded from recipe
Audio encoder audio_model.* BF16 Excluded from recipe
CosyVoice2 TTS decoder tts.* BF16 Excluded from recipe
Embedding layers re:.*embed_tokens$ BF16 Standard practice (ignore list)
Layer norms re:.*norm$ BF16 Standard practice (ignore list)
LM head lm_head BF16 Standard practice (ignore list)

The full MiniCPM-o-4.5 checkpoint is saved via model.save_pretrained() after in-place quantization of model.llm, so the output contains all modalities β€” vision, audio, and TTS encoders remain at full BF16 fidelity.

Implementation notes

MiniCPM-o-4.5 required four patches to run cleanly through llmcompressor:

  1. get_imports patch — filters minicpmo, librosa, and soundfile imports to avoid the librosa→soxr cascade during quantization.
  2. MiniCPMTTSConfig.__getattr__ patch β€” backfills top_p, top_k, and related attributes missing from the shipped config.json.
  3. _move_missing_keys_from_meta_to_device wrap β€” handles all_tied_weights_keys not being set by MiniCPMO's remote code under transformers 5.8.1.
  4. is_mllm_model=False override β€” forces AutoRound through the standard LLM path instead of the multimodal MLLM compressor, which would fail trying to load a processor from model.llm directly.

Additionally, torch.nn.Module.apply and torch.nn.Module.train are replaced with iterative equivalents to avoid stack overflow on MiniCPM-o's ~985-deep module tree.


Quality Targets

Metric Target
KL divergence vs BF16 < 0.014
MMLU recovery β‰₯ 99%
RULER@128k β‰₯ 97%

Competitor Comparables

MiniCPM-o-4.5 is an omni model β€” meaningful comparisons must also support vision + audio input. As of publication, no other compressed-tensors or vLLM-native quantization of this model exists.

Model Source Format Compare angle
openbmb/MiniCPM-o-4.5 official BF16 Quality ceiling
88plug/MiniCPM-o-4.5-W8A16 88plug compressed-tensors W8A16 Higher-precision sibling
88plug/MiniCPM-o-4.5-W4A16 88plug compressed-tensors W4A16 This model

First-to-market claim: No compressed-tensors or vLLM-native W4A16 quant was found for this model at publication time. This is the only production-ready W4A16 quant for direct vLLM serving.


Benchmarks

Results pending.

Engine Format Batch ctx tok/s TTFT p50 TTFT p99 VRAM
vLLM v0.21.0 W4A16 compressed-tensors 1 32k β€” β€” β€” β€”
vLLM v0.21.0 W4A16 compressed-tensors 8 32k β€” β€” β€” β€”
vLLM v0.21.0 W4A16 compressed-tensors 1 128k β€” β€” β€” β€”
SGLang v0.5.8 BF16 (baseline) 1 32k β€” β€” β€” β€”
llama.cpp b9297 Q4_K_M GGUF 1 32k β€” β€” β€” β€”
llama.cpp b9297 IQ4_XS GGUF 1 32k β€” β€” β€” β€”

Hardware: A6000 48 GB, CUDA 12.9, driver 570.


SGLang

SGLang does not natively support compressed-tensors. To use this model with SGLang, serve the BF16 base (openbmb/MiniCPM-o-4.5) or an AWQ variant.

docker run --gpus device=0 -p 30000:30000 \
  lmsysorg/sglang:v0.5.8-cu129 python -m sglang.launch_server \
  --model-path openbmb/MiniCPM-o-4.5 \
  --tp 1 \
  --mem-fraction-static 0.85 \
  --port 30000

SGLang results are BF16 baseline β€” useful as a throughput ceiling reference, not a direct quality comparison to this quant.


llama.cpp

Mainline llama.cpp supports MiniCPM-V (vision + text). For full CosyVoice2 speech output, use the tc-mb/llama.cpp-omni fork. Convert and quantize from the BF16 base β€” do not convert from compressed-tensors weights.

python convert_hf_to_gguf.py openbmb/MiniCPM-o-4.5 \
  --outfile MiniCPM-o-4.5-BF16.gguf

llama-quantize MiniCPM-o-4.5-BF16.gguf MiniCPM-o-4.5-Q4_K_M.gguf Q4_K_M
llama-quantize --imatrix calibration_datav3.txt \
  MiniCPM-o-4.5-BF16.gguf MiniCPM-o-4.5-IQ4_XS.gguf IQ4_XS

llama-server \
  --model MiniCPM-o-4.5-Q4_K_M.gguf \
  --n-gpu-layers 999 \
  --ctx-size 32768 \
  --port 8081

Technical Details

Parameter Value
Quantizer AutoRound (via llmcompressor AutoRoundModifier)
Targets ["Linear"] within model.llm
Scheme W4A16
Calibration iters 200
Pipeline sequential
Calibration samples 1024 (512 UltraChat + 512 Wikitext-103)
Max seq length 2048
Ignore list lm_head, re:.*embed_tokens$, re:.*norm$
Activations FP16 (unquantized β€” W4A16)
trust_remote_code required

Citation

@misc{minicpmo,
  title  = {MiniCPM-o: A GPT-4o Level Multimodal LLM on Your Phone},
  author = {MiniCPM Team, OpenBMB},
  year   = {2025},
  url    = {https://huggingface.co/openbmb/MiniCPM-o-4.5}
}

About

88plug AI Lab produces production-grade compressed-tensors quantizations of frontier LLMs, VLMs, and omni models β€” built for native vLLM v0.21.0+ deployment with zero extra flags.

W8A16 β€” INT8 weights + BF16 activations. Near-lossless on any Ampere+ GPU. Runs where FP8 hardware cannot.

W4A16 β€” AutoRound with iters=200 and a mixed calibration corpus. Targets β‰₯ 99% MMLU recovery β€” the quality bar that makes W4A16 viable for production.

All weights are in compressed-tensors format. vLLM detects quantization automatically from quantization_config in config.json. No --quantization flag required.

Also available: MiniCPM-o-4.5-W8A16 (INT8, ~9 GB) Β· MiniCPM-o-4.5-W4A16 (INT4, ~4–5 GB)

Browse all releases β†’ huggingface.co/88plug

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Evaluation results