MiniCPM5-1B on AXERA NPU

Ready-to-run deployment package for openbmb/MiniCPM5-1B on AX650 / NPU3.

This release packages the AX650 axllm runtime together with the compiled text .axmodel files.
The packaged runtime is configured for text-only inference on AX650 / NPU3.
The packaged context layout is prefill_len=128, kv_cache_len=2047, and prefill_max_token_num=1280.
Thinking is disabled by default and can be enabled per request through the public OpenAI-compatible API.
The package includes the tokenizer, runtime config files, and the validated bin/axllm binary for board-side deployment.

Supported Platform

AX650 / NPU3

Validated Devices

This package has been validated on the following AX650-based device:

AX650 / NPU3 development board

Performance

All measurements below were taken on AX650 / NPU3 with the packaged axllm runtime. TTFT stands for time to first token. In this table, TTFT is measured end-to-end from request arrival at axllm serve to the first generated token.

The validated text prompts below were kept within one 128-token prefill chunk. To avoid one-time startup effects, each TTFT row excludes the first request for that prompt pattern.

Scenario	Input tokens	Prefill chunks	TTFT	Decode
Text smoke prompt	`24`	`1 x 128`	`160.34 ms avg` (`159.40-161.28 ms`)	`n/a (single-token reply)`
Short front-end prompt	`14`	`1 x 128`	`157.76 ms avg` (`157.68-157.84 ms`)	`n/a (short reply)`
Multi-turn text prompt	`40`	`1 x 128`	`159.89 ms avg` (`159.19-160.59 ms`)	`n/a (short reply)`
Long text generation reference	`30`	`1 x 128`	`159.91 ms avg` (`159.34-160.49 ms`)	`17.96 token/s avg`

The packaged runtime uses the following context layout:

prefill_len=128
kv_cache_len=2047
prefill_max_token_num=1280

The Long text generation reference row is the recommended sustained text-only decode figure for this package. Very short replies under-report decode speed because EOS and response-tail overhead become relatively larger.

Startup Runtime Footprint

Item	Value
`Flash total (24 text axmodels + post axmodel + embedding bin)`	`1.42 GiB` (`1456.71 MiB`)
`Package flash total (excluding .git/)`	`1.43 GiB` (`1464.24 MiB`)
`Runtime CMM requirement`	`Board-dependent; validate on your target AX650 CMM pool`

On the validated AX650 board, the packaged startup log confirmed max_token_len=2047, prefill_len=128, and prefill_max_token_num=1280. This README does not present one board's remain_cmm(...) value as a package-wide memory requirement, because the absolute remaining CMM pool depends on the board's global memory layout.

Package Layout

.
├── README.md
├── bin/
│   ├── axllm
│   └── axllm.version.json
├── config.json
├── post_config.json
├── minicpm5_tokenizer.txt
├── model.embed_tokens.weight.bfloat16.bin
├── llama_p128_l0_together.axmodel
├── ...
├── llama_p128_l23_together.axmodel
└── llama_post.axmodel

This package uses a flat runtime layout. The packaged axllm binary reads the root-level runtime files directly, so no extra path arguments are required when you serve the repository root.

Direct Inference with `axllm`

Download the Model Package

Download the release package from Hugging Face:

mkdir -p AXERA-TECH/MiniCPM5-1B
cd AXERA-TECH/MiniCPM5-1B
hf download AXERA-TECH/MiniCPM5-1B --local-dir .

Install `axllm`

Option 1: use the validated binary included in this repository:

chmod +x ./bin/axllm

Option 2: install axllm from the public repository:

git clone -b axllm https://github.com/AXERA-TECH/ax-llm.git
cd ax-llm
./install.sh

Option 3: install with a one-line command:

curl -fsSL https://raw.githubusercontent.com/AXERA-TECH/ax-llm/axllm/install.sh | bash

Option 4: download the prebuilt binary from GitHub Actions CI:

If you do not have a local build environment, download the latest CI-generated axllm binary from GitHub Actions: https://github.com/AXERA-TECH/ax-llm/actions?query=branch%3Aaxllm Then run:

chmod +x axllm
sudo mv axllm /usr/bin/axllm

Run on the Board

This package already includes a validated bin/axllm binary for AX650.

From the package root on the board:

chmod +x ./bin/axllm
./bin/axllm serve . --port 8000

Expected model id:

AXERA-TECH/MiniCPM5-1B-AX650-C128-P1152-CTX2047

Health check and model listing:

curl http://127.0.0.1:8000/health
curl http://127.0.0.1:8000/v1/models

Example health output:

{
  "concurrency": 0,
  "max_concurrency": 1,
  "status": "healthy"
}

Example model list output:

{
  "data": [
    {
      "created": 1780908633,
      "id": "AXERA-TECH/MiniCPM5-1B-AX650-C128-P1152-CTX2047",
      "object": "model",
      "owned_by": "openai-api"
    }
  ],
  "object": "list"
}

Text Request

By default, this package uses no-thinking mode because the packaged config.json sets enable_thinking=false.

curl http://127.0.0.1:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "AXERA-TECH/MiniCPM5-1B-AX650-C128-P1152-CTX2047",
    "messages": [
      {
        "role": "user",
        "content": "中国的首都是哪里？请只回答城市名。"
      }
    ],
    "max_tokens": 32,
    "temperature": 0
  }'

Example output:

{
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": "北京"
      },
      "finish_reason": "stop"
    }
  ],
  "model": "AXERA-TECH/MiniCPM5-1B-AX650-C128-P1152-CTX2047",
  "object": "chat.completion"
}

Enable Thinking Per Request

To enable explicit reasoning output for a single request, pass top-level enable_thinking=true:

curl http://127.0.0.1:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "AXERA-TECH/MiniCPM5-1B-AX650-C128-P1152-CTX2047",
    "messages": [
      {
        "role": "user",
        "content": "中国的首都是哪里？请简短思考后给最终答案。"
      }
    ],
    "enable_thinking": true,
    "max_tokens": 384,
    "temperature": 0
  }'

Typical output shape:

{
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": "<think>\n...\n</think>\n\n中国的首都是北京。"
      },
      "finish_reason": "stop"
    }
  ],
  "model": "AXERA-TECH/MiniCPM5-1B-AX650-C128-P1152-CTX2047",
  "object": "chat.completion"
}

The Hugging Face-style request form is also accepted:

{
  "chat_template_kwargs": {
    "enable_thinking": true
  }
}

When thinking mode is enabled, the service returns client-visible <think>...</think> markup so front ends can render reasoning and final answer separately. Follow-up turns also keep the official MiniCPM5 template behavior: previous assistant reasoning content is not reinserted into the next user prompt.

Browser UI with `lite_webui`

If you want a browser UI for the OpenAI-compatible service started by axllm serve, use AXERA-TECH/lite_webui.

Set the OpenAI base URL to http://<board-ip>:8000 and the model name to AXERA-TECH/MiniCPM5-1B-AX650-C128-P1152-CTX2047.

Conversion References

If you need the original model files or want to rebuild the deployment artifacts, start with:

Original Hugging Face model: openbmb/MiniCPM5-1B
AXERA conversion and deployment workflow: AXERA-TECH/MiniCPM5-1B.axera

Discussion

GitHub Issues
QQ group: 139953715

Downloads last month: 47

Model tree for AXERA-TECH/MiniCPM5-1B

Base model

openbmb/MiniCPM5-1B

Finetuned

(19)

this model

Collection including AXERA-TECH/MiniCPM5-1B

MiniCPM

Collection

5 items • Updated 1 day ago