Instructions to use openbmb/MiniCPM-V-4_5 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use openbmb/MiniCPM-V-4_5 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="openbmb/MiniCPM-V-4_5", trust_remote_code=True) messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("openbmb/MiniCPM-V-4_5", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use openbmb/MiniCPM-V-4_5 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "openbmb/MiniCPM-V-4_5" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "openbmb/MiniCPM-V-4_5", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/openbmb/MiniCPM-V-4_5
- SGLang
How to use openbmb/MiniCPM-V-4_5 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "openbmb/MiniCPM-V-4_5" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "openbmb/MiniCPM-V-4_5", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "openbmb/MiniCPM-V-4_5" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "openbmb/MiniCPM-V-4_5", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use openbmb/MiniCPM-V-4_5 with Docker Model Runner:
docker model run hf.co/openbmb/MiniCPM-V-4_5
GGUF When?
When is the GGUF quantized version releasing?
https://huggingface.co/openbmb/MiniCPM-V-4_5-gguf
We've received your question. We've provided support for gguf in this repository. There should be a lot of it. The cache may not have been refreshed. Please check again.
https://huggingface.co/openbmb/MiniCPM-V-4_5-gguf
We've received your question. We've provided support for gguf in this repository. There should be a lot of it. The cache may not have been refreshed. Please check again.
May I ask when the issue of gguf in llama-server being unable to disable reasoning mode after startup will be resolved? Thank you very much.
@zhouxihong I noticed that llama.cpp already includes instructions for using think mode.
You can disable think mode for the model by specifying "export LLAMA_ARG_THINK=0" to disable the think mode environment variable.
Hope this helps. If you have any further questions, feel free to raise an issue.
@zhouxihong I noticed that llama.cpp already includes instructions for using think mode.
You can disable think mode for the model by specifying "export LLAMA_ARG_THINK=0" to disable the think mode environment variable.
Hope this helps. If you have any further questions, feel free to raise an issue.
Under normal circumstances, the “disable reasoning” mode provided by llama.cpp is already sufficient, especially for text models. However, for minicpm-v-4.5, using --reasoning-budget 0 still does not work. As for the LLAMA_ARG_THINK environment variable, I tried setting it to 0, but after doing so the program immediately threw an error. Upon checking, this variable seems to only accept two parameters: none or deepseek, which are used to control the reasoning format. It still cannot be used to disable reasoning:
E:\Downloads\llama-b6337-bin-win-cuda-12.4-x64>llama-server.exe --port 8011 --no-mmap -ngl 200 -c 8192 -n 8192 -m E:\Downloads\minicpm-v-4.5.gguf --mmproj E:\Downloads\minicpm-v-4.5_mmproj-model-f16.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4070 Ti, compute capability 8.9, VMM: yes
load_backend: loaded CUDA backend from E:\Downloads\llama-b6337-bin-win-cuda-12.4-x64\ggml-cuda.dll
load_backend: loaded RPC backend from E:\Downloads\llama-b6337-bin-win-cuda-12.4-x64\ggml-rpc.dll
load_backend: loaded CPU backend from E:\Downloads\llama-b6337-bin-win-cuda-12.4-x64\ggml-cpu-haswell.dll
error while handling environment variable "LLAMA_ARG_THINK": Unknown reasoning format: 0
@zhouxihong I see. I may not have tested the GPU.
I'll take a look at what the error is.
If it's convenient, could you send me your test data?
My email address is “caitianchi@modelbest.cn”.
My steps are as follows:
First, I start llama-server normally (the command I ran is shown above, and it is indeed running on GPU). Then I use an image for dialogue. Even with --reasoning-budget 0, the option has no effect. What’s strange is that for some images reasoning can actually be disabled, but in most cases, as soon as the image is slightly more complex, reasoning can no longer be disabled.
@zhouxihong I see. I may not have tested the GPU.
I'll take a look at what the error is.
If it's convenient, could you send me your test data?
My email address is “caitianchi@modelbest.cn”.
I noticed that the text model of minicpm-v-4.5 is qwen3, so I used a qwen3 Jinja template and forced it to be passed during llama.cpp initialization to avoid reasoning. This method works. The command is as follows:
llama-server.exe --port 8011 --no-mmap -ngl 200 -c 8192 -n 8192 -m E:\Downloads\minicpm-v-4.5.gguf --mmproj E:\Downloads\minicpm-v-4.5_mmproj-model-f16.gguf --chat-template-file E:\Downloads\qwen3_nonthinking.jinja --jinja
However, the --reasoning-budget 0 setting is ineffective, which might be a bug in llama.cpp regarding the reasoning switch for multimodal models. As a temporary workaround, the Jinja template method above can be used.
@zhouxihong I think it can't fully reuse qwen3 templates.
I used export LLAMA_ARG_THINK=0 and it worked, but I haven't tested whether it also works with llama-server. I'll test that later.
@zhouxihong I think it can't fully reuse qwen3 templates.
I used export LLAMA_ARG_THINK=0 and it worked, but I haven't tested whether it also works with llama-server. I'll test that later.
Alright, I’m really looking forward to the official fix from the minicpm team. Many thanks!
@zhouxihong Thank you for your understanding. We will submit the changes to llama.cpp as soon as possible.
