---
library_name: transformers
license: apache-2.0
base_model:
- openbmb/MiniCPM5-1B
pipeline_tag: text-generation
tags:
- minicpm5
- llm
- thinking
- axera
- ax650
language:
- en
- zh
---
# MiniCPM5-1B on AXERA NPU
Ready-to-run deployment package for `openbmb/MiniCPM5-1B` on AX650 / NPU3.
- This release packages the AX650 `axllm` runtime together with the compiled text `.axmodel` files.
- The packaged runtime is configured for text-only inference on AX650 / NPU3.
- The packaged context layout is `prefill_len=128`, `kv_cache_len=2047`, and `prefill_max_token_num=1280`.
- Thinking is disabled by default and can be enabled per request through the public OpenAI-compatible API.
- The package includes the tokenizer, runtime config files, and the validated `bin/axllm` binary for board-side deployment.
## Supported Platform
- [x] AX650 / NPU3
## Validated Devices
This package has been validated on the following AX650-based device:
- AX650 / NPU3 development board
## Performance
All measurements below were taken on AX650 / NPU3 with the packaged `axllm` runtime. `TTFT` stands for time to first token. In this table, `TTFT` is measured end-to-end from request arrival at `axllm serve` to the first generated token.
The validated text prompts below were kept within one `128`-token prefill chunk. To avoid one-time startup effects, each `TTFT` row excludes the first request for that prompt pattern.
| Scenario | Input tokens | Prefill chunks | TTFT | Decode |
|---|---:|---:|---:|---:|
| Text smoke prompt | `24` | `1 x 128` | `160.34 ms avg` (`159.40-161.28 ms`) | `n/a (single-token reply)` |
| Short front-end prompt | `14` | `1 x 128` | `157.76 ms avg` (`157.68-157.84 ms`) | `n/a (short reply)` |
| Multi-turn text prompt | `40` | `1 x 128` | `159.89 ms avg` (`159.19-160.59 ms`) | `n/a (short reply)` |
| Long text generation reference | `30` | `1 x 128` | `159.91 ms avg` (`159.34-160.49 ms`) | `17.96 token/s avg` |
The packaged runtime uses the following context layout:
- `prefill_len=128`
- `kv_cache_len=2047`
- `prefill_max_token_num=1280`
The `Long text generation reference` row is the recommended sustained text-only decode figure for this package. Very short replies under-report decode speed because EOS and response-tail overhead become relatively larger.
## Startup Runtime Footprint
| Item | Value |
|---|---:|
| `Flash total (24 text axmodels + post axmodel + embedding bin)` | `1.42 GiB` (`1456.71 MiB`) |
| `Package flash total (excluding .git/)` | `1.43 GiB` (`1464.24 MiB`) |
| `Runtime CMM requirement` | `Board-dependent; validate on your target AX650 CMM pool` |
On the validated AX650 board, the packaged startup log confirmed `max_token_len=2047`, `prefill_len=128`, and `prefill_max_token_num=1280`. This README does not present one board's `remain_cmm(...)` value as a package-wide memory requirement, because the absolute remaining CMM pool depends on the board's global memory layout.
## Package Layout
```text
.
├── README.md
├── bin/
│ ├── axllm
│ └── axllm.version.json
├── config.json
├── post_config.json
├── minicpm5_tokenizer.txt
├── model.embed_tokens.weight.bfloat16.bin
├── llama_p128_l0_together.axmodel
├── ...
├── llama_p128_l23_together.axmodel
└── llama_post.axmodel
```
This package uses a flat runtime layout. The packaged `axllm` binary reads the root-level runtime files directly, so no extra path arguments are required when you serve the repository root.
## Direct Inference with `axllm`
### Download the Model Package
Download the release package from Hugging Face:
```shell
mkdir -p AXERA-TECH/MiniCPM5-1B
cd AXERA-TECH/MiniCPM5-1B
hf download AXERA-TECH/MiniCPM5-1B --local-dir .
```
### Install `axllm`
Option 1: use the validated binary included in this repository:
```bash
chmod +x ./bin/axllm
```
Option 2: install `axllm` from the public repository:
```shell
git clone -b axllm https://github.com/AXERA-TECH/ax-llm.git
cd ax-llm
./install.sh
```
Option 3: install with a one-line command:
```shell
curl -fsSL https://raw.githubusercontent.com/AXERA-TECH/ax-llm/axllm/install.sh | bash
```
Option 4: download the prebuilt binary from GitHub Actions CI:
If you do not have a local build environment, download the latest CI-generated `axllm` binary from GitHub Actions:
`https://github.com/AXERA-TECH/ax-llm/actions?query=branch%3Aaxllm`
Then run:
```shell
chmod +x axllm
sudo mv axllm /usr/bin/axllm
```
### Run on the Board
This package already includes a validated `bin/axllm` binary for AX650.
From the package root on the board:
```bash
chmod +x ./bin/axllm
./bin/axllm serve . --port 8000
```
Expected model id:
```text
AXERA-TECH/MiniCPM5-1B-AX650-C128-P1152-CTX2047
```
Health check and model listing:
```bash
curl http://127.0.0.1:8000/health
curl http://127.0.0.1:8000/v1/models
```
Example health output:
```json
{
"concurrency": 0,
"max_concurrency": 1,
"status": "healthy"
}
```
Example model list output:
```json
{
"data": [
{
"created": 1780908633,
"id": "AXERA-TECH/MiniCPM5-1B-AX650-C128-P1152-CTX2047",
"object": "model",
"owned_by": "openai-api"
}
],
"object": "list"
}
```
### Text Request
By default, this package uses no-thinking mode because the packaged `config.json` sets `enable_thinking=false`.
```bash
curl http://127.0.0.1:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "AXERA-TECH/MiniCPM5-1B-AX650-C128-P1152-CTX2047",
"messages": [
{
"role": "user",
"content": "中国的首都是哪里?请只回答城市名。"
}
],
"max_tokens": 32,
"temperature": 0
}'
```
Example output:
```json
{
"choices": [
{
"message": {
"role": "assistant",
"content": "北京"
},
"finish_reason": "stop"
}
],
"model": "AXERA-TECH/MiniCPM5-1B-AX650-C128-P1152-CTX2047",
"object": "chat.completion"
}
```
### Enable Thinking Per Request
To enable explicit reasoning output for a single request, pass top-level `enable_thinking=true`:
```bash
curl http://127.0.0.1:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "AXERA-TECH/MiniCPM5-1B-AX650-C128-P1152-CTX2047",
"messages": [
{
"role": "user",
"content": "中国的首都是哪里?请简短思考后给最终答案。"
}
],
"enable_thinking": true,
"max_tokens": 384,
"temperature": 0
}'
```
Typical output shape:
```json
{
"choices": [
{
"message": {
"role": "assistant",
"content": "\n...\n\n\n中国的首都是北京。"
},
"finish_reason": "stop"
}
],
"model": "AXERA-TECH/MiniCPM5-1B-AX650-C128-P1152-CTX2047",
"object": "chat.completion"
}
```
The Hugging Face-style request form is also accepted:
```json
{
"chat_template_kwargs": {
"enable_thinking": true
}
}
```
When thinking mode is enabled, the service returns client-visible `...` markup so front ends can render reasoning and final answer separately. Follow-up turns also keep the official MiniCPM5 template behavior: previous assistant reasoning content is not reinserted into the next user prompt.
## Browser UI with `lite_webui`
If you want a browser UI for the OpenAI-compatible service started by `axllm serve`, use [AXERA-TECH/lite_webui](https://huggingface.co/AXERA-TECH/lite_webui/tree/main).
Set the OpenAI base URL to `http://:8000` and the model name to `AXERA-TECH/MiniCPM5-1B-AX650-C128-P1152-CTX2047`.
## Conversion References
If you need the original model files or want to rebuild the deployment artifacts, start with:
- Original Hugging Face model: [openbmb/MiniCPM5-1B](https://huggingface.co/openbmb/MiniCPM5-1B)
- AXERA conversion and deployment workflow: [AXERA-TECH/MiniCPM5-1B.axera](https://github.com/AXERA-TECH/MiniCPM5-1B.axera)
## Discussion
- GitHub Issues
- QQ group: `139953715`