Instructions to use openbmb/MiniCPM-V-4-gguf with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use openbmb/MiniCPM-V-4-gguf with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="openbmb/MiniCPM-V-4-gguf", trust_remote_code=True) messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("openbmb/MiniCPM-V-4-gguf", trust_remote_code=True, dtype="auto") - llama-cpp-python
How to use openbmb/MiniCPM-V-4-gguf with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="openbmb/MiniCPM-V-4-gguf", filename="ggml-model-Q4_0.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use openbmb/MiniCPM-V-4-gguf with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf openbmb/MiniCPM-V-4-gguf:Q4_K_M # Run inference directly in the terminal: llama-cli -hf openbmb/MiniCPM-V-4-gguf:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf openbmb/MiniCPM-V-4-gguf:Q4_K_M # Run inference directly in the terminal: llama-cli -hf openbmb/MiniCPM-V-4-gguf:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf openbmb/MiniCPM-V-4-gguf:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf openbmb/MiniCPM-V-4-gguf:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf openbmb/MiniCPM-V-4-gguf:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf openbmb/MiniCPM-V-4-gguf:Q4_K_M
Use Docker
docker model run hf.co/openbmb/MiniCPM-V-4-gguf:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use openbmb/MiniCPM-V-4-gguf with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "openbmb/MiniCPM-V-4-gguf" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "openbmb/MiniCPM-V-4-gguf", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/openbmb/MiniCPM-V-4-gguf:Q4_K_M
- SGLang
How to use openbmb/MiniCPM-V-4-gguf with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "openbmb/MiniCPM-V-4-gguf" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "openbmb/MiniCPM-V-4-gguf", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "openbmb/MiniCPM-V-4-gguf" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "openbmb/MiniCPM-V-4-gguf", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Ollama
How to use openbmb/MiniCPM-V-4-gguf with Ollama:
ollama run hf.co/openbmb/MiniCPM-V-4-gguf:Q4_K_M
- Unsloth Studio
How to use openbmb/MiniCPM-V-4-gguf with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for openbmb/MiniCPM-V-4-gguf to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for openbmb/MiniCPM-V-4-gguf to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for openbmb/MiniCPM-V-4-gguf to start chatting
- Docker Model Runner
How to use openbmb/MiniCPM-V-4-gguf with Docker Model Runner:
docker model run hf.co/openbmb/MiniCPM-V-4-gguf:Q4_K_M
- Lemonade
How to use openbmb/MiniCPM-V-4-gguf with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull openbmb/MiniCPM-V-4-gguf:Q4_K_M
Run and chat with the model
lemonade run user.MiniCPM-V-4-gguf-Q4_K_M
List all available models
lemonade list
Error with ollama
I tried both Q8_0 and Q4_K_M, but encountered the following errors:
500: llama runner process has terminated: exit status 2
I'm experiencing same issues with RTX environments using Docker, as well as on Mac with native app.
We have merged it into llama.cpp, but the PR I submitted to ollama has not been merged yet.
We will actively promote this matter.
Thanks for answering.
There was no problem with using llama.cpp.
Same error , below log if this help to solve issue
clip_model_loader: tensor[445]: n_dims = 1, name = v.blk.26.ln1.weight, tensor_size=4608, offset=938869376, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[446]: n_dims = 1, name = v.blk.26.ln1.bias, tensor_size=4608, offset=938873984, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[447]: n_dims = 2, name = v.blk.26.ffn_down.weight, tensor_size=9916416, offset=938878592, shape:[1152, 4304, 1, 1], type = f16
clip_model_loader: tensor[448]: n_dims = 1, name = v.blk.26.ffn_down.bias, tensor_size=17216, offset=948795008, shape:[4304, 1, 1, 1], type = f32
clip_model_loader: tensor[449]: n_dims = 2, name = v.blk.26.ffn_up.weight, tensor_size=9916416, offset=948812224, shape:[4304, 1152, 1, 1], type = f16
clip_model_loader: tensor[450]: n_dims = 1, name = v.blk.26.ffn_up.bias, tensor_size=4608, offset=958728640, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[451]: n_dims = 1, name = v.blk.26.ln2.weight, tensor_size=4608, offset=958733248, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[452]: n_dims = 1, name = v.blk.26.ln2.bias, tensor_size=4608, offset=958737856, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[453]: n_dims = 1, name = v.post_ln.weight, tensor_size=4608, offset=958742464, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[454]: n_dims = 1, name = v.post_ln.bias, tensor_size=4608, offset=958747072, shape:[1152, 1, 1, 1], type = f32
load_hparams: projector: resampler
load_hparams: n_embd: 1152
load_hparams: n_head: 16
load_hparams: n_ff: 4304
load_hparams: n_layer: 27
load_hparams: projection_dim: 0
load_hparams: image_size: 448
load_hparams: patch_size: 14
load_hparams: has_llava_proj: 0
load_hparams: minicpmv_version: 5
load_hparams: proj_scale_factor: 0
load_hparams: n_wa_pattern: 0
load_hparams: ffn_op: gelu
load_hparams: model size: 914.34 MiB
load_hparams: metadata size: 0.16 MiB
load_tensors: loaded 455 tensors from E:\AI\Ollama\blobs\sha256-f0faa9ae63532300999c86a196f140c716cd0fbb08bbbd81850f1f9a631f7761
clip.cpp:3782: Unknown minicpmv version
time=2025-08-10T23:22:49.085+03:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server error"
time=2025-08-10T23:22:49.335+03:00 level=ERROR source=sched.go:487 msg="error loading llama server" error="llama runner process has terminated: exit status 0xc0000409"
[GIN] 2025/08/10 - 23:22:49 | 500 | 4.8977797s | 127.0.0.1 | POST "/api/generate"
time=2025-08-10T23:22:54.359+03:00 level=WARN source=sched.go:685 msg="gpu VRAM usage didn't recover within timeout" seconds=5.0235821 runner.size="4.1 GiB" runner.vram="3.2 GiB" runner.parallel=1 runner.pid=27496 runner.model=E:\AI\Ollama\blobs\sha256-b0ff610e9c92b30389ff1e0dd40fffed3c1f02a9d34a735fd5fba6a5ad25672b
time=2025-08-10T23:22:54.608+03:00 level=WARN source=sched.go:685 msg="gpu VRAM usage didn't recover within timeout" seconds=5.2732448 runner.size="4.1 GiB" runner.vram="3.2 GiB" runner.parallel=1 runner.pid=27496 runner.model=E:\AI\Ollama\blobs\sha256-b0ff610e9c92b30389ff1e0dd40fffed3c1f02a9d34a735fd5fba6a5ad25672b
time=2025-08-10T23:22:54.859+03:00 level=WARN source=sched.go:685 msg="gpu VRAM usage didn't recover within timeout" seconds=5.5234548 runner.size="4.1 GiB" runner.vram="3.2 GiB" runner.parallel=1 runner.pid=27496 runner.model=E:\AI\Ollama\blobs\sha256-b0ff610e9c92b30389ff1e0dd40fffed3c1f02a9d34a735fd5fba6a5ad25672b
Thanks for answering.
There was no problem with using llama.cpp.
Hi , How to run it using llama.cpp if i download it from ollama , can use ollama blobs by llama.cpp