MiniCPM-o-4.5-W8A16

INT8 post-training quantization of openbmb/MiniCPM-o-4.5 — a compact omni model with vision (SigLIP2), audio (Whisper), and speech synthesis (CosyVoice2) built on a Qwen3-8B backbone. ~9 GB on disk. Runs on any 16 GB GPU.


At a Glance

Property Value
Base model openbmb/MiniCPM-o-4.5
Architecture Qwen3-8B LLM + SigLIP2 vision + Whisper audio + CosyVoice2 TTS
Quant format compressed-tensors (native vLLM)
Quant method AutoRound W8A16 (RTN, datafree)
Quantized model.llm transformer layers
Kept BF16 vision encoder, audio encoder, TTS components
Disk size ~9 GB
Min GPU 1× RTX 3090 24GB

Memory Requirements

Configuration BF16 W8A16
Weights ~18 GB ~9 GB
Min GPU 1× A100 40GB 1× RTX 3090 24GB

Quick Start

Tested with vLLM v0.21.0 (vllm/vllm-openai:v0.21.0-cu129-ubuntu2404). Weights are in compressed-tensors format — vLLM detects and loads quantization automatically. No --quantization flag needed.

vLLM — text output

docker run --gpus device=0 -p 8080:8080 \
  vllm/vllm-openai:v0.21.0-cu129-ubuntu2404 vllm serve \
  88plug/MiniCPM-o-4.5-W8A16 \
  --kv-cache-dtype fp8 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90

Weights are in compressed-tensors format — no --quantization flag needed. Requires vLLM ≥ v0.21.0. Mainline vLLM returns text only; CosyVoice2 TTS output is not supported.

llama.cpp — audio/vision in, text out

Mainline llama.cpp supports MiniCPM-V (vision + text). For full CosyVoice2 speech output, use the tc-mb/llama.cpp-omni fork. Convert from BF16 base.

python convert_hf_to_gguf.py openbmb/MiniCPM-o-4.5 \
  --outfile MiniCPM-o-4.5-BF16.gguf

llama-quantize MiniCPM-o-4.5-BF16.gguf MiniCPM-o-4.5-Q8_0.gguf Q8_0
llama-quantize --imatrix calibration_datav3.txt \
  MiniCPM-o-4.5-BF16.gguf MiniCPM-o-4.5-IQ4_XS.gguf IQ4_XS

llama-server \
  --model MiniCPM-o-4.5-Q8_0.gguf \
  --n-gpu-layers 999 \
  --ctx-size 32768 \
  --port 8081

Benchmarks

Results pending.

Engine Format Batch ctx tok/s TTFT p50 TTFT p99 VRAM
vLLM v0.21.0 W8A16 1 32k
vLLM v0.21.0 W8A16 8 32k
llama.cpp b9297 Q8_0 GGUF 1 32k
llama.cpp b9297 IQ4_XS GGUF 1 32k

Hardware: A6000 48 GB, CUDA 12.9, driver 570.


What's Quantized, What's Not

Component Precision Reason
model.llm.* transformer layers W8A16 INT8 Quantized
Vision encoder (SigLIP2) BF16 Excluded
Audio encoder (Whisper) BF16 Excluded
CosyVoice2 TTS BF16 Excluded
Embeddings, LM head, norms BF16 Standard practice

Quality Targets

Metric Target
KL divergence vs BF16 < 0.005
MMLU recovery ≥ 99.7%

vs. Other MiniCPM-o-4.5 Quants

This is the first compressed-tensors W8A16 checkpoint for MiniCPM-o-4.5. It halves VRAM usage while retaining native vLLM serving with audio and vision input.

Quant Method Size GPU Compatibility Notes
88plug W8A16 (this) compressed-tensors RTN W8A16 ~9 GB Any Ampere+ ≥16 GB First W8A16; native vLLM; LLM backbone quantized
Community GGUF Q4_K_M llama.cpp GGUF ~5 GB CPU / any GPU Vision via mmproj; no CosyVoice2 in mainline
Community GGUF Q8_0 llama.cpp GGUF ~9 GB Any GPU ≥10 GB Near-lossless; same TTS limitation
BF16 baseline None ~18 GB 1× A100 40GB Reference; requires high-VRAM GPU

Limitations

  • LLM backbone only: Only model.llm transformer layers are quantized. Vision encoder (SigLIP2), audio encoder (Whisper), and CosyVoice2 TTS components stay BF16.
  • No CosyVoice2 in mainline vLLM: Speech output is not supported by mainline vLLM. Use the tc-mb/llama.cpp-omni fork for speech synthesis.
  • RTN (data-free) quantization: No calibration corpus used for the LLM backbone. Near-lossless at W8A16 but not AutoRound-calibrated.
  • Benchmark results pending: Throughput and quality benchmarks will be added post-publication.

Citation

@misc{minicpmo,
  title  = {MiniCPM-o: A GPT-4o Level Multimodal LLM on Your Phone},
  author = {MiniCPM Team, OpenBMB},
  year   = {2025},
  url    = {https://huggingface.co/openbmb/MiniCPM-o-4.5}
}

About

88plug AI Lab produces production-grade compressed-tensors quantizations of frontier LLMs, VLMs, and omni models — built for native vLLM v0.21.0+ deployment with zero extra flags.

W8A16 — INT8 weights + BF16 activations. Near-lossless on any Ampere+ GPU. Runs where FP8 hardware cannot.

W4A16 — AutoRound with iters=200 and a mixed calibration corpus. Targets ≥ 99% MMLU recovery — the quality bar that makes W4A16 viable for production.

All weights are in compressed-tensors format. vLLM detects quantization automatically from quantization_config in config.json. No --quantization flag required.

Also available: MiniCPM-o-4.5-W4A16 (INT4, ~4–5 GB) · MiniCPM-o-4.5-W8A16 (INT8, ~9 GB)

Browse all releases → huggingface.co/88plug

Downloads last month
-
Safetensors
Model size
9B params
Tensor type
I64
·
I32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Evaluation results