Instructions to use 88plug/MiniCPM-o-4.5-W8A16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use 88plug/MiniCPM-o-4.5-W8A16 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="88plug/MiniCPM-o-4.5-W8A16", trust_remote_code=True)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("88plug/MiniCPM-o-4.5-W8A16", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use 88plug/MiniCPM-o-4.5-W8A16 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "88plug/MiniCPM-o-4.5-W8A16"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "88plug/MiniCPM-o-4.5-W8A16",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/88plug/MiniCPM-o-4.5-W8A16

SGLang

How to use 88plug/MiniCPM-o-4.5-W8A16 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "88plug/MiniCPM-o-4.5-W8A16" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "88plug/MiniCPM-o-4.5-W8A16",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "88plug/MiniCPM-o-4.5-W8A16" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "88plug/MiniCPM-o-4.5-W8A16",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use 88plug/MiniCPM-o-4.5-W8A16 with Docker Model Runner:
```
docker model run hf.co/88plug/MiniCPM-o-4.5-W8A16
```

MiniCPM-o-4.5-W8A16

INT8 post-training quantization of openbmb/MiniCPM-o-4.5 — a compact omni model with vision (SigLIP2), audio (Whisper), and speech synthesis (CosyVoice2) built on a Qwen3-8B backbone. ~9 GB on disk. Runs on any 16 GB GPU.

At a Glance

Property	Value
Base model	`openbmb/MiniCPM-o-4.5`
Release tier	Provisional (datafree RTN — re-quant scheduled)
Quant method	datafree RTN W8A16 (weight-only INT8)
FLAC status	Not measured (T+7d milestone)
Architecture	Qwen3-8B LLM + SigLIP2 vision + Whisper audio + CosyVoice2 TTS
Quant format	compressed-tensors (native vLLM)
Quantized	`model.llm` transformer layers
Kept BF16	vision encoder, audio encoder, TTS components
Disk size	~9 GB
Min GPU	1× RTX 3090 24GB

Memory Requirements

Configuration	BF16	W8A16
Weights	~18 GB	~9 GB
Min GPU	1× A100 40GB	1× RTX 3090 24GB

Quick Start

Tested with vLLM v0.21.0 (vllm/vllm-openai:v0.21.0-cu129-ubuntu2404). Weights are in compressed-tensors format — vLLM detects and loads quantization automatically. No --quantization flag needed.

vLLM — text output

docker run --gpus device=0 -p 8080:8080 \
  vllm/vllm-openai:v0.21.0-cu129-ubuntu2404 vllm serve \
  88plug/MiniCPM-o-4.5-W8A16 \
  --kv-cache-dtype fp8 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90

Weights are in compressed-tensors format — no --quantization flag needed. Requires vLLM ≥ v0.21.0. Mainline vLLM returns text only; CosyVoice2 TTS output is not supported.

llama.cpp — audio/vision in, text out

Mainline llama.cpp supports MiniCPM-V (vision + text). For full CosyVoice2 speech output, use the tc-mb/llama.cpp-omni fork. Convert from BF16 base.

python convert_hf_to_gguf.py openbmb/MiniCPM-o-4.5 \
  --outfile MiniCPM-o-4.5-BF16.gguf

llama-quantize MiniCPM-o-4.5-BF16.gguf MiniCPM-o-4.5-Q8_0.gguf Q8_0
llama-quantize --imatrix calibration_datav3.txt \
  MiniCPM-o-4.5-BF16.gguf MiniCPM-o-4.5-IQ4_XS.gguf IQ4_XS

llama-server \
  --model MiniCPM-o-4.5-Q8_0.gguf \
  --n-gpu-layers 999 \
  --ctx-size 32768 \
  --port 8081

Benchmarks

Metric	Status
Throughput (tok/s)	In progress — T+7d milestone
MMLU delta vs BF16	In progress — T+7d milestone
RULER@128k	In progress — T+30d milestone

No fabricated numbers. Results will be published to this card when measured.

What's Quantized, What's Not

Component	Precision	Reason
`model.llm.*` transformer layers	W8A16 INT8	Quantized
Vision encoder (SigLIP2)	BF16	Excluded
Audio encoder (Whisper)	BF16	Excluded
CosyVoice2 TTS	BF16	Excluded
Embeddings, LM head, norms	BF16	Standard practice

Quality Targets

Metric	Target
KL divergence vs BF16	< 0.005
MMLU recovery	≥ 99.7%

vs. Other MiniCPM-o-4.5 Quants

This is the first compressed-tensors W8A16 checkpoint for MiniCPM-o-4.5. It halves VRAM usage while retaining native vLLM serving with audio and vision input.

Quant	Method	Size	GPU Compatibility	Notes
88plug W8A16 (this)	compressed-tensors RTN W8A16	~9 GB	Any Ampere+ ≥16 GB	First W8A16; native vLLM; LLM backbone quantized
Community GGUF Q4_K_M	llama.cpp GGUF	~5 GB	CPU / any GPU	Vision via mmproj; no CosyVoice2 in mainline
Community GGUF Q8_0	llama.cpp GGUF	~9 GB	Any GPU ≥10 GB	Near-lossless; same TTS limitation
BF16 baseline	None	~18 GB	1× A100 40GB	Reference; requires high-VRAM GPU

Limitations

LLM backbone only: Only model.llm transformer layers are quantized. Vision encoder (SigLIP2), audio encoder (Whisper), and CosyVoice2 TTS components stay BF16.
No CosyVoice2 in mainline vLLM: Speech output is not supported by mainline vLLM. Use the tc-mb/llama.cpp-omni fork for speech synthesis.
RTN (data-free) quantization: No calibration corpus used for the LLM backbone. Near-lossless at W8A16 but not AutoRound-calibrated.
Benchmark results pending: Throughput and quality benchmarks will be added post-publication.

Citation

@misc{minicpmo,
  title  = {MiniCPM-o: A GPT-4o Level Multimodal LLM on Your Phone},
  author = {MiniCPM Team, OpenBMB},
  year   = {2025},
  url    = {https://huggingface.co/openbmb/MiniCPM-o-4.5}
}

About

88plug AI Lab ships compressed-tensors quantizations for native vLLM v0.21.0+ deployment.

This release: Provisional tier — datafree RTN (weight-only rounding, no calibration corpus). A gold AutoRound re-quant is scheduled; 88plug architecture forbids new provisional W4A16 uploads.

Browse all releases → huggingface.co/88plug

Downloads last month: 140

Safetensors

Model size

9B params

Tensor type

I64

I32

BF16