Instructions to use AXERA-TECH/MiniCPM-V-4.6-GPTQ with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use AXERA-TECH/MiniCPM-V-4.6-GPTQ with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="AXERA-TECH/MiniCPM-V-4.6-GPTQ")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("AXERA-TECH/MiniCPM-V-4.6-GPTQ", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use AXERA-TECH/MiniCPM-V-4.6-GPTQ with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "AXERA-TECH/MiniCPM-V-4.6-GPTQ" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AXERA-TECH/MiniCPM-V-4.6-GPTQ", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/AXERA-TECH/MiniCPM-V-4.6-GPTQ
- SGLang
How to use AXERA-TECH/MiniCPM-V-4.6-GPTQ with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "AXERA-TECH/MiniCPM-V-4.6-GPTQ" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AXERA-TECH/MiniCPM-V-4.6-GPTQ", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "AXERA-TECH/MiniCPM-V-4.6-GPTQ" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AXERA-TECH/MiniCPM-V-4.6-GPTQ", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use AXERA-TECH/MiniCPM-V-4.6-GPTQ with Docker Model Runner:
docker model run hf.co/AXERA-TECH/MiniCPM-V-4.6-GPTQ
MiniCPM-V-4.6-GPTQ on AXERA NPU
Ready-to-run deployment package for openbmb/MiniCPM-V-4.6-GPTQ on AX650 / NPU3.
- This release packages the AX650
axllmruntime together with the compiled text and vision.axmodelfiles. - The packaged text runtime uses the GPTQ INT4 build.
- The packaged vision runtime uses a fixed-shape
448x448MiniCPM-V-4.6 vision encoder. - The package supports text-only chat, single-image understanding, and video understanding through the OpenAI-compatible
axllm serveAPI. - The package includes sample assets for image and video validation.
Supported Platform
- AX650 / NPU3
Validated Devices
This package has been validated on the following AX650-based device:
- AX650 / NPU3 development board
Performance
All measurements below were taken on AX650 / NPU3. TTFT stands for time to first token. In this table, TTFT is measured end-to-end from request arrival at axllm serve to the first generated token, so the multimodal rows include media preprocessing and vision encoding time.
The text-only smoke prompt was kept within one 128-token prefill chunk. To avoid one-time startup effects, the text row below excludes the first request after service startup. Its Decode figure was measured with longer text-only generations (max_tokens=256) to better reflect sustained decode throughput; the short smoke reply used for the TTFT row is effectively a single-token answer and would otherwise under-report decode speed. The image row was measured with the packaged fixed-shape 448x448 vision encoder and assets/sample.png. The video row used the packaged sample video with video:assets/red-panda-openai.mp4:2.
| Scenario | Input tokens | Prefill chunks | TTFT | Decode |
|---|---|---|---|---|
| Text-only smoke prompt | 25 |
1 x 128 |
260.81 ms avg (259.01-262.61 ms) |
24.07 token/s avg |
| Image prompt | 88 |
1 x 128 |
719.79 ms avg (708.57-732.47 ms) |
24.49 token/s avg |
| Video prompt | 1271 |
10 x 128 |
9555.33 ms avg (9484.00-9647.62 ms) |
23.87 token/s avg |
The packaged runtime uses the following context layout:
prefill_len=128kv_cache_len=2047prefill_max_token_num=1280
Input tokens in the table above refers to the full request length after chat templating, not just the visual soft tokens. For the shipped 448x448 vision encoder, each selected image block contributes 64 visual soft tokens. Under the current packaged runtime settings, the sample video request in this README uses 1271 total input tokens and spans 10 prefill chunks.
Startup Runtime Footprint
| Item | Value |
|---|---|
Flash total (text + post + vision axmodels) |
1.19 GiB (1214.38 MiB) |
Package flash total (current repository layout, excluding runtime-generated vision_cache/) |
1.68 GiB (1719.30 MiB) |
Runtime CMM increment during board-side startup |
1.30 GiB (1334.05 MiB) |
The runtime CMM value above was measured during board-side startup on a shared AX650 system and should be treated as a practical reference value.
Vision Encoder Latency
Measured on AX650 / NPU3 with /opt/bin/ax_run_model -m minicpmv4_6_vision_448.axmodel -g 0 -w 1 -r 5.
| Model | Resolution | Soft Tokens | Time (ms) |
|---|---|---|---|
minicpmv4_6_vision_448.axmodel |
448x448 |
64 |
235.285 ms avg |
For this packaged AX650 runtime, the visual token count is fixed by the shipped vision encoder configuration:
vision_width = 448vision_height = 448vision_patch_size = 14- patch grid =
(448 / 14) x (448 / 14) = 32 x 32 - raw patch tokens =
32 x 32 = 1024 - current packaged build uses the
16xvisual compression path Soft Tokens = 1024 / 16 = 64
So, for the fixed-shape runtime shipped in this repository, the relation is:
Soft Tokens = (vision_width / patch_size) x (vision_height / patch_size) / 16
Input tokens in the performance table can be larger than the visual Soft Tokens because axllm counts the full templated request, including user text and chat-template tokens in addition to the visual tokens. For the packaged assets/sample.png request in this README, the runtime reports input_num_token=88, which still fits within a single 128-token prefill chunk.
Soft Tokens is not a runtime-configurable value in this package. This repository ships only minicpmv4_6_vision_448.axmodel, so the board-side AX650 runtime always uses 448x448 -> 64 soft tokens for image encoding.
Package Layout
.
├── README.md
├── bin/
│ ├── axllm
│ └── axllm.version.json
├── assets/
│ ├── openai_api_demo.png
│ ├── red-panda-openai.mp4
│ └── sample.png
├── minicpmv4_6_vision_448.axmodel
├── qwen3_5_text_p128_l0_together.axmodel
├── ...
├── qwen3_5_text_p128_l23_together.axmodel
├── qwen3_5_text_post.axmodel
├── model.embed_tokens.weight.bfloat16.bin
├── config.json
├── post_config.json
└── minicpm_v46_tokenizer.txt
This package keeps the runtime files at the repository root so it can be served directly by axllm.
Sample Image
Both the axllm flow and the packaged sample requests can use the sample image:
assets/sample.png
Sample Video
The package also includes a packaged sample video for board-side video understanding validation:
assets/red-panda-openai.mp4
Direct Inference with axllm
The
axllmworkflow is still being refined. The instructions below reflect the current validated flow and may be adjusted as the packaging continues to evolve.
Download the Model Package
Download the release package from Hugging Face:
mkdir -p AXERA-TECH/MiniCPM-V-4.6-GPTQ
cd AXERA-TECH/MiniCPM-V-4.6-GPTQ
hf download AXERA-TECH/MiniCPM-V-4.6-GPTQ --local-dir .
Install axllm
Option 1: use the validated binary included in this repository:
chmod +x ./bin/axllm
Option 2: install axllm from the public repository:
git clone -b axllm https://github.com/AXERA-TECH/ax-llm.git
cd ax-llm
./install.sh
Option 3: install with a one-line command:
curl -fsSL https://raw.githubusercontent.com/AXERA-TECH/ax-llm/axllm/install.sh | bash
Option 4: download the prebuilt binary from GitHub Actions CI:
If you do not have a local build environment, download the latest CI-generated axllm binary from GitHub Actions:
https://github.com/AXERA-TECH/ax-llm/actions?query=branch%3Aaxllm
Then run:
chmod +x axllm
sudo mv axllm /usr/bin/axllm
Run on the Board
The package root is already arranged for axllm, so no extra runtime path arguments are required.
For multimodal testing, you can use the packaged sample image shown above: ./assets/sample.png, or the packaged sample video: ./assets/red-panda-openai.mp4.
./bin/axllm run .
In interactive mode:
- press
Enterdirectly for text-only chat - input an image path for single-image chat
- input
video:/path/to/frames_dirorvideo:/path/to/video.mp4for video chat
Serve with axllm
From the package root on the board:
./bin/axllm serve . --port 8000
Expected model id:
AXERA-TECH/MiniCPM-V-4.6-GPTQ-AX650-C128-P1152-CTX2047
Health check:
curl http://127.0.0.1:8000/health
A typical startup log looks like this:
INF Init | LLM init start
INF Init | mixed attention enabled: full_attention_interval=4 ref_full_layer_idx=3
INF Init | attention config: layers=24 sliding=0 full=6 linear=18 sliding_window=0 ref_full_layer_idx=3
tokenizer_type = 3
huggingface tokenizer mode = gpt2_byte_bpe
...
INF Init | max_token_len : 2047
INF Init | kv_cache_size : 512, kv_cache_num: 2047
INF init_groups_from_model | prefill_token_num : 128
INF init_groups_from_model | prefill_max_token_num : 1280
INF Init | MiniCPM-V-4.6 token ids: image_pad=248056 video_pad=248057
INF Init | VisionModule init ok: type=MiniCPMV46VL, tokens_per_block=64, embed_size=1024, out_dtype=fp32
INF Init | LLM init ok
Starting server on port 8000 with model 'AXERA-TECH/MiniCPM-V-4.6-GPTQ-AX650-C128-P1152-CTX2047'...
API URLs:
GET http://127.0.0.1:8000/health
GET http://127.0.0.1:8000/v1/models
POST http://127.0.0.1:8000/v1/chat/completions
OpenAI API Server starting on http://0.0.0.0:8000
Max concurrency: 1
Models: AXERA-TECH/MiniCPM-V-4.6-GPTQ-AX650-C128-P1152-CTX2047
You can then send requests to the server using the API endpoints shown in the log. For example, to check the health status and list the available models:
curl http://127.0.0.1:8000/health
curl http://127.0.0.1:8000/v1/models
Example output:
{
"concurrency": 0,
"max_concurrency": 1,
"status": "healthy"
}
{
"data": [
{
"created": 1780911663,
"id": "AXERA-TECH/MiniCPM-V-4.6-GPTQ-AX650-C128-P1152-CTX2047",
"object": "model",
"owned_by": "openai-api"
}
],
"object": "list"
}
Text Request
curl http://127.0.0.1:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "AXERA-TECH/MiniCPM-V-4.6-GPTQ-AX650-C128-P1152-CTX2047",
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": "What is 1+1? Reply with the number only."}
]
}
],
"max_tokens": 32
}'
Example output:
{
"choices": [
{
"message": {
"role": "assistant",
"content": "2"
},
"finish_reason": "stop"
}
],
"model": "AXERA-TECH/MiniCPM-V-4.6-GPTQ-AX650-C128-P1152-CTX2047",
"object": "chat.completion"
}
Image Request
python3 - <<'PY'
import base64
import json
from pathlib import Path
from urllib.request import Request, urlopen
img = Path("assets/sample.png").read_bytes()
payload = {
"model": "AXERA-TECH/MiniCPM-V-4.6-GPTQ-AX650-C128-P1152-CTX2047",
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": "Please briefly describe this image."},
{
"type": "image_url",
"image_url": {
"url": "data:image/png;base64," + base64.b64encode(img).decode()
},
},
],
}
],
"max_tokens": 64,
}
req = Request(
"http://127.0.0.1:8000/v1/chat/completions",
data=json.dumps(payload).encode(),
headers={"Content-Type": "application/json"},
)
with urlopen(req, timeout=60) as resp:
print(resp.read().decode())
PY
Example output:
{
"choices": [
{
"message": {
"role": "assistant",
"content": "The image shows a colorful, cartoon-style red lobster or lobster-like character with a cheerful expression, raised claws, and a dynamic, action-oriented pose."
},
"finish_reason": "stop"
}
],
"model": "AXERA-TECH/MiniCPM-V-4.6-GPTQ-AX650-C128-P1152-CTX2047",
"object": "chat.completion"
}
Video Request
axllm serve accepts either a frames directory or a raw video file:
curl http://127.0.0.1:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "AXERA-TECH/MiniCPM-V-4.6-GPTQ-AX650-C128-P1152-CTX2047",
"messages": [
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "video:/path/to/frames_dir"}},
{"type": "text", "text": "Describe this video briefly."}
]
}
],
"max_tokens": 128
}'
For a raw video file, use video:/path/to/video.mp4. If you need to request a specific sampling FPS, use the form video:/path/to/video.mp4:2.
To test the packaged sample video from the package root, you can set:
VIDEO_PATH="$(pwd)/assets/red-panda-openai.mp4"
and then use video:${VIDEO_PATH}:2 in the request payload.
Example output:
{
"choices": [
{
"message": {
"role": "assistant",
"content": "The red panda is seen playing with the other red panda."
},
"finish_reason": "stop"
}
],
"model": "AXERA-TECH/MiniCPM-V-4.6-GPTQ-AX650-C128-P1152-CTX2047",
"object": "chat.completion"
}
Browser UI with lite_webui
If you want a browser UI for the OpenAI-compatible service started by axllm serve, use AXERA-TECH/lite_webui.
Set the OpenAI base URL to http://<board-ip>:8000 and the model name to AXERA-TECH/MiniCPM-V-4.6-GPTQ-AX650-C128-P1152-CTX2047.
Conversion References
If you need the original model files or want to rebuild the deployment artifacts, start with:
- Original Hugging Face model: openbmb/MiniCPM-V-4.6-GPTQ
- AXERA conversion and deployment workflow: AXERA-TECH/MiniCPM-V-4.6.axera
Discussion
- GitHub Issues
- QQ group:
139953715
- Downloads last month
- 34

