Instructions to use AXERA-TECH/MiniCPM5-1B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use AXERA-TECH/MiniCPM5-1B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="AXERA-TECH/MiniCPM5-1B")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("AXERA-TECH/MiniCPM5-1B", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use AXERA-TECH/MiniCPM5-1B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "AXERA-TECH/MiniCPM5-1B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AXERA-TECH/MiniCPM5-1B", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/AXERA-TECH/MiniCPM5-1B
- SGLang
How to use AXERA-TECH/MiniCPM5-1B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "AXERA-TECH/MiniCPM5-1B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AXERA-TECH/MiniCPM5-1B", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "AXERA-TECH/MiniCPM5-1B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AXERA-TECH/MiniCPM5-1B", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use AXERA-TECH/MiniCPM5-1B with Docker Model Runner:
docker model run hf.co/AXERA-TECH/MiniCPM5-1B
MiniCPM5-1B on AXERA NPU
Ready-to-run deployment package for openbmb/MiniCPM5-1B on AX650 / NPU3.
- This release packages the AX650
axllmruntime together with the compiled text.axmodelfiles. - The packaged runtime is configured for text-only inference on AX650 / NPU3.
- The packaged context layout is
prefill_len=128,kv_cache_len=2047, andprefill_max_token_num=1280. - Thinking is disabled by default and can be enabled per request through the public OpenAI-compatible API.
- The package includes the tokenizer, runtime config files, and the validated
bin/axllmbinary for board-side deployment.
Supported Platform
- AX650 / NPU3
Validated Devices
This package has been validated on the following AX650-based device:
- AX650 / NPU3 development board
Performance
All measurements below were taken on AX650 / NPU3 with the packaged axllm runtime. TTFT stands for time to first token. In this table, TTFT is measured end-to-end from request arrival at axllm serve to the first generated token.
The validated text prompts below were kept within one 128-token prefill chunk. To avoid one-time startup effects, each TTFT row excludes the first request for that prompt pattern.
| Scenario | Input tokens | Prefill chunks | TTFT | Decode |
|---|---|---|---|---|
| Text smoke prompt | 24 |
1 x 128 |
160.34 ms avg (159.40-161.28 ms) |
n/a (single-token reply) |
| Short front-end prompt | 14 |
1 x 128 |
157.76 ms avg (157.68-157.84 ms) |
n/a (short reply) |
| Multi-turn text prompt | 40 |
1 x 128 |
159.89 ms avg (159.19-160.59 ms) |
n/a (short reply) |
| Long text generation reference | 30 |
1 x 128 |
159.91 ms avg (159.34-160.49 ms) |
17.96 token/s avg |
The packaged runtime uses the following context layout:
prefill_len=128kv_cache_len=2047prefill_max_token_num=1280
The Long text generation reference row is the recommended sustained text-only decode figure for this package. Very short replies under-report decode speed because EOS and response-tail overhead become relatively larger.
Startup Runtime Footprint
| Item | Value |
|---|---|
Flash total (24 text axmodels + post axmodel + embedding bin) |
1.42 GiB (1456.71 MiB) |
Package flash total (excluding .git/) |
1.43 GiB (1464.24 MiB) |
Runtime CMM requirement |
Board-dependent; validate on your target AX650 CMM pool |
On the validated AX650 board, the packaged startup log confirmed max_token_len=2047, prefill_len=128, and prefill_max_token_num=1280. This README does not present one board's remain_cmm(...) value as a package-wide memory requirement, because the absolute remaining CMM pool depends on the board's global memory layout.
Package Layout
.
โโโ README.md
โโโ bin/
โ โโโ axllm
โ โโโ axllm.version.json
โโโ config.json
โโโ post_config.json
โโโ minicpm5_tokenizer.txt
โโโ model.embed_tokens.weight.bfloat16.bin
โโโ llama_p128_l0_together.axmodel
โโโ ...
โโโ llama_p128_l23_together.axmodel
โโโ llama_post.axmodel
This package uses a flat runtime layout. The packaged axllm binary reads the root-level runtime files directly, so no extra path arguments are required when you serve the repository root.
Direct Inference with axllm
Download the Model Package
Download the release package from Hugging Face:
mkdir -p AXERA-TECH/MiniCPM5-1B
cd AXERA-TECH/MiniCPM5-1B
hf download AXERA-TECH/MiniCPM5-1B --local-dir .
Install axllm
Option 1: use the validated binary included in this repository:
chmod +x ./bin/axllm
Option 2: install axllm from the public repository:
git clone -b axllm https://github.com/AXERA-TECH/ax-llm.git
cd ax-llm
./install.sh
Option 3: install with a one-line command:
curl -fsSL https://raw.githubusercontent.com/AXERA-TECH/ax-llm/axllm/install.sh | bash
Option 4: download the prebuilt binary from GitHub Actions CI:
If you do not have a local build environment, download the latest CI-generated axllm binary from GitHub Actions:
https://github.com/AXERA-TECH/ax-llm/actions?query=branch%3Aaxllm
Then run:
chmod +x axllm
sudo mv axllm /usr/bin/axllm
Run on the Board
This package already includes a validated bin/axllm binary for AX650.
From the package root on the board:
chmod +x ./bin/axllm
./bin/axllm serve . --port 8000
Expected model id:
AXERA-TECH/MiniCPM5-1B-AX650-C128-P1152-CTX2047
Health check and model listing:
curl http://127.0.0.1:8000/health
curl http://127.0.0.1:8000/v1/models
Example health output:
{
"concurrency": 0,
"max_concurrency": 1,
"status": "healthy"
}
Example model list output:
{
"data": [
{
"created": 1780908633,
"id": "AXERA-TECH/MiniCPM5-1B-AX650-C128-P1152-CTX2047",
"object": "model",
"owned_by": "openai-api"
}
],
"object": "list"
}
Text Request
By default, this package uses no-thinking mode because the packaged config.json sets enable_thinking=false.
curl http://127.0.0.1:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "AXERA-TECH/MiniCPM5-1B-AX650-C128-P1152-CTX2047",
"messages": [
{
"role": "user",
"content": "ไธญๅฝ็้ฆ้ฝๆฏๅช้๏ผ่ฏทๅชๅ็ญๅๅธๅใ"
}
],
"max_tokens": 32,
"temperature": 0
}'
Example output:
{
"choices": [
{
"message": {
"role": "assistant",
"content": "ๅไบฌ"
},
"finish_reason": "stop"
}
],
"model": "AXERA-TECH/MiniCPM5-1B-AX650-C128-P1152-CTX2047",
"object": "chat.completion"
}
Enable Thinking Per Request
To enable explicit reasoning output for a single request, pass top-level enable_thinking=true:
curl http://127.0.0.1:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "AXERA-TECH/MiniCPM5-1B-AX650-C128-P1152-CTX2047",
"messages": [
{
"role": "user",
"content": "ไธญๅฝ็้ฆ้ฝๆฏๅช้๏ผ่ฏท็ฎ็ญๆ่ๅ็ปๆ็ป็ญๆกใ"
}
],
"enable_thinking": true,
"max_tokens": 384,
"temperature": 0
}'
Typical output shape:
{
"choices": [
{
"message": {
"role": "assistant",
"content": "<think>\n...\n</think>\n\nไธญๅฝ็้ฆ้ฝๆฏๅไบฌใ"
},
"finish_reason": "stop"
}
],
"model": "AXERA-TECH/MiniCPM5-1B-AX650-C128-P1152-CTX2047",
"object": "chat.completion"
}
The Hugging Face-style request form is also accepted:
{
"chat_template_kwargs": {
"enable_thinking": true
}
}
When thinking mode is enabled, the service returns client-visible <think>...</think> markup so front ends can render reasoning and final answer separately. Follow-up turns also keep the official MiniCPM5 template behavior: previous assistant reasoning content is not reinserted into the next user prompt.
Browser UI with lite_webui
If you want a browser UI for the OpenAI-compatible service started by axllm serve, use AXERA-TECH/lite_webui.
Set the OpenAI base URL to http://<board-ip>:8000 and the model name to AXERA-TECH/MiniCPM5-1B-AX650-C128-P1152-CTX2047.
Conversion References
If you need the original model files or want to rebuild the deployment artifacts, start with:
- Original Hugging Face model: openbmb/MiniCPM5-1B
- AXERA conversion and deployment workflow: AXERA-TECH/MiniCPM5-1B.axera
Discussion
- GitHub Issues
- QQ group:
139953715
- Downloads last month
- 47
Model tree for AXERA-TECH/MiniCPM5-1B
Base model
openbmb/MiniCPM5-1B