Instructions to use openbmb/MiniCPM-V-4_5 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use openbmb/MiniCPM-V-4_5 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="openbmb/MiniCPM-V-4_5", trust_remote_code=True)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("openbmb/MiniCPM-V-4_5", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use openbmb/MiniCPM-V-4_5 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "openbmb/MiniCPM-V-4_5"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "openbmb/MiniCPM-V-4_5",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/openbmb/MiniCPM-V-4_5

SGLang

How to use openbmb/MiniCPM-V-4_5 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "openbmb/MiniCPM-V-4_5" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "openbmb/MiniCPM-V-4_5",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "openbmb/MiniCPM-V-4_5" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "openbmb/MiniCPM-V-4_5",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use openbmb/MiniCPM-V-4_5 with Docker Model Runner:
```
docker model run hf.co/openbmb/MiniCPM-V-4_5
```

GGUF When?

by xTimeCrystal - opened Aug 26, 2025

Discussion

xTimeCrystal

Aug 26, 2025

When is the GGUF quantized version releasing?

l33tkr3w

Aug 26, 2025

already is one, https://huggingface.co/openbmb/MiniCPM-V-4_5-gguf

tc-mb

OpenBMB org Aug 26, 2025

https://huggingface.co/openbmb/MiniCPM-V-4_5-gguf
We've received your question. We've provided support for gguf in this repository. There should be a lot of it. The cache may not have been refreshed. Please check again.

zhouxihong

Sep 1, 2025

https://huggingface.co/openbmb/MiniCPM-V-4_5-gguf
We've received your question. We've provided support for gguf in this repository. There should be a lot of it. The cache may not have been refreshed. Please check again.

May I ask when the issue of gguf in llama-server being unable to disable reasoning mode after startup will be resolved? Thank you very much.

tc-mb

OpenBMB org Sep 1, 2025

@zhouxihong Ok, I will submit a PR to resolve it before Wednesday.

tc-mb

OpenBMB org Sep 2, 2025

@zhouxihong I noticed that llama.cpp already includes instructions for using think mode.
You can disable think mode for the model by specifying "export LLAMA_ARG_THINK=0" to disable the think mode environment variable.
Hope this helps. If you have any further questions, feel free to raise an issue.

zhouxihong

Sep 2, 2025

@zhouxihong I noticed that llama.cpp already includes instructions for using think mode.
You can disable think mode for the model by specifying "export LLAMA_ARG_THINK=0" to disable the think mode environment variable.
Hope this helps. If you have any further questions, feel free to raise an issue.

Under normal circumstances, the “disable reasoning” mode provided by llama.cpp is already sufficient, especially for text models. However, for minicpm-v-4.5, using --reasoning-budget 0 still does not work. As for the LLAMA_ARG_THINK environment variable, I tried setting it to 0, but after doing so the program immediately threw an error. Upon checking, this variable seems to only accept two parameters: none or deepseek, which are used to control the reasoning format. It still cannot be used to disable reasoning:

E:\Downloads\llama-b6337-bin-win-cuda-12.4-x64>llama-server.exe --port 8011 --no-mmap -ngl 200 -c 8192 -n 8192 -m E:\Downloads\minicpm-v-4.5.gguf --mmproj E:\Downloads\minicpm-v-4.5_mmproj-model-f16.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4070 Ti, compute capability 8.9, VMM: yes
load_backend: loaded CUDA backend from E:\Downloads\llama-b6337-bin-win-cuda-12.4-x64\ggml-cuda.dll
load_backend: loaded RPC backend from E:\Downloads\llama-b6337-bin-win-cuda-12.4-x64\ggml-rpc.dll
load_backend: loaded CPU backend from E:\Downloads\llama-b6337-bin-win-cuda-12.4-x64\ggml-cpu-haswell.dll
error while handling environment variable "LLAMA_ARG_THINK": Unknown reasoning format: 0

tc-mb

OpenBMB org Sep 2, 2025

@zhouxihong I see. I may not have tested the GPU.
I'll take a look at what the error is.
If it's convenient, could you send me your test data?
My email address is “caitianchi@modelbest.cn”.

zhouxihong

Sep 2, 2025

My steps are as follows:
First, I start llama-server normally (the command I ran is shown above, and it is indeed running on GPU). Then I use an image for dialogue. Even with --reasoning-budget 0, the option has no effect. What’s strange is that for some images reasoning can actually be disabled, but in most cases, as soon as the image is slightly more complex, reasoning can no longer be disabled.

zhouxihong

Sep 2, 2025

@zhouxihong I see. I may not have tested the GPU.
I'll take a look at what the error is.
If it's convenient, could you send me your test data?
My email address is “caitianchi@modelbest.cn”.

I noticed that the text model of minicpm-v-4.5 is qwen3, so I used a qwen3 Jinja template and forced it to be passed during llama.cpp initialization to avoid reasoning. This method works. The command is as follows:

llama-server.exe --port 8011 --no-mmap -ngl 200 -c 8192 -n 8192 -m E:\Downloads\minicpm-v-4.5.gguf --mmproj E:\Downloads\minicpm-v-4.5_mmproj-model-f16.gguf --chat-template-file E:\Downloads\qwen3_nonthinking.jinja --jinja

However, the --reasoning-budget 0 setting is ineffective, which might be a bug in llama.cpp regarding the reasoning switch for multimodal models. As a temporary workaround, the Jinja template method above can be used.

tc-mb

OpenBMB org Sep 2, 2025

@zhouxihong I think it can't fully reuse qwen3 templates.
I used export LLAMA_ARG_THINK=0 and it worked, but I haven't tested whether it also works with llama-server. I'll test that later.

zhouxihong

Sep 2, 2025

@zhouxihong I think it can't fully reuse qwen3 templates.
I used export LLAMA_ARG_THINK=0 and it worked, but I haven't tested whether it also works with llama-server. I'll test that later.

Alright, I’m really looking forward to the official fix from the minicpm team. Many thanks!

tc-mb

OpenBMB org Sep 2, 2025

@zhouxihong Thank you for your understanding. We will submit the changes to llama.cpp as soon as possible.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment