Image-Text-to-Text
Transformers
Safetensors
English
minicpm
omni
multimodal
audio
vision
tts
quantized
int4
INT4
w4a16
4-bit precision
compressed-tensors
vllm
text-generation
conversational
ptq
autoround
llmcompressor
sglang
text-generation-inference
88plug
post-training-quantization
vlm
image
Instructions to use 88plug/MiniCPM-o-4.5-W4A16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use 88plug/MiniCPM-o-4.5-W4A16 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="88plug/MiniCPM-o-4.5-W4A16") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("88plug/MiniCPM-o-4.5-W4A16", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use 88plug/MiniCPM-o-4.5-W4A16 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "88plug/MiniCPM-o-4.5-W4A16" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "88plug/MiniCPM-o-4.5-W4A16", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/88plug/MiniCPM-o-4.5-W4A16
- SGLang
How to use 88plug/MiniCPM-o-4.5-W4A16 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "88plug/MiniCPM-o-4.5-W4A16" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "88plug/MiniCPM-o-4.5-W4A16", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "88plug/MiniCPM-o-4.5-W4A16" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "88plug/MiniCPM-o-4.5-W4A16", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use 88plug/MiniCPM-o-4.5-W4A16 with Docker Model Runner:
docker model run hf.co/88plug/MiniCPM-o-4.5-W4A16
Update model card
Browse files
README.md
CHANGED
|
@@ -13,13 +13,16 @@ tags:
|
|
| 13 |
- tts
|
| 14 |
- quantized
|
| 15 |
- int4
|
|
|
|
| 16 |
- w4a16
|
| 17 |
- 4-bit
|
| 18 |
- compressed-tensors
|
| 19 |
- vllm
|
| 20 |
- text-generation
|
|
|
|
| 21 |
- ptq
|
| 22 |
- autoround
|
|
|
|
| 23 |
pipeline_tag: text-generation
|
| 24 |
library_name: transformers
|
| 25 |
model_type: minicpmo
|
|
@@ -63,6 +66,8 @@ Note: The non-quantized modal encoders (SigLIP2 ~1 GB, Whisper ~390 MB, CosyVoic
|
|
| 63 |
|
| 64 |
## Quick Start
|
| 65 |
|
|
|
|
|
|
|
| 66 |
### vLLM — text output
|
| 67 |
|
| 68 |
```bash
|
|
@@ -238,4 +243,14 @@ llama-server \
|
|
| 238 |
|
| 239 |
## About
|
| 240 |
|
| 241 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 13 |
- tts
|
| 14 |
- quantized
|
| 15 |
- int4
|
| 16 |
+
- INT4
|
| 17 |
- w4a16
|
| 18 |
- 4-bit
|
| 19 |
- compressed-tensors
|
| 20 |
- vllm
|
| 21 |
- text-generation
|
| 22 |
+
- conversational
|
| 23 |
- ptq
|
| 24 |
- autoround
|
| 25 |
+
- llmcompressor
|
| 26 |
pipeline_tag: text-generation
|
| 27 |
library_name: transformers
|
| 28 |
model_type: minicpmo
|
|
|
|
| 66 |
|
| 67 |
## Quick Start
|
| 68 |
|
| 69 |
+
Tested with **vLLM v0.21.0** (`vllm/vllm-openai:v0.21.0-cu129-ubuntu2404`). Weights are in **compressed-tensors** format — vLLM detects and loads quantization automatically. No `--quantization` flag needed.
|
| 70 |
+
|
| 71 |
### vLLM — text output
|
| 72 |
|
| 73 |
```bash
|
|
|
|
| 243 |
|
| 244 |
## About
|
| 245 |
|
| 246 |
+
[**88plug AI Lab**](https://huggingface.co/88plug) produces production-grade compressed-tensors quantizations of frontier LLMs, VLMs, and omni models — built for native vLLM v0.21.0+ deployment with zero extra flags.
|
| 247 |
+
|
| 248 |
+
**W8A16** — INT8 weights + BF16 activations. Near-lossless on any Ampere+ GPU. Runs where FP8 hardware cannot.
|
| 249 |
+
|
| 250 |
+
**W4A16** — AutoRound with iters=200 and a mixed calibration corpus. Targets ≥ 99% MMLU recovery — the quality bar that makes W4A16 viable for production.
|
| 251 |
+
|
| 252 |
+
All weights are in compressed-tensors format. vLLM detects quantization automatically from `quantization_config` in `config.json`. No `--quantization` flag required.
|
| 253 |
+
|
| 254 |
+
**Also available:** [MiniCPM-o-4.5-W8A16 (INT8, ~9 GB)](https://huggingface.co/88plug/MiniCPM-o-4.5-W8A16) · [MiniCPM-o-4.5-W4A16 (INT4, ~4–5 GB)](https://huggingface.co/88plug/MiniCPM-o-4.5-W4A16)
|
| 255 |
+
|
| 256 |
+
Browse all releases → [huggingface.co/88plug](https://huggingface.co/88plug)
|