MiniCPM5-1B on AXERA NPU

Ready-to-run deployment package for openbmb/MiniCPM5-1B on AX650 / NPU3.

  • This release packages the AX650 axllm runtime together with the compiled text .axmodel files.
  • The packaged runtime is configured for text-only inference on AX650 / NPU3.
  • The packaged context layout is prefill_len=128, kv_cache_len=2047, and prefill_max_token_num=1280.
  • Thinking is disabled by default and can be enabled per request through the public OpenAI-compatible API.
  • The package includes the tokenizer, runtime config files, and the validated bin/axllm binary for board-side deployment.

Supported Platform

  • AX650 / NPU3

Validated Devices

This package has been validated on the following AX650-based device:

  • AX650 / NPU3 development board

Performance

All measurements below were taken on AX650 / NPU3 with the packaged axllm runtime. TTFT stands for time to first token. In this table, TTFT is measured end-to-end from request arrival at axllm serve to the first generated token.

The validated text prompts below were kept within one 128-token prefill chunk. To avoid one-time startup effects, each TTFT row excludes the first request for that prompt pattern.

Scenario Input tokens Prefill chunks TTFT Decode
Text smoke prompt 24 1 x 128 160.34 ms avg (159.40-161.28 ms) n/a (single-token reply)
Short front-end prompt 14 1 x 128 157.76 ms avg (157.68-157.84 ms) n/a (short reply)
Multi-turn text prompt 40 1 x 128 159.89 ms avg (159.19-160.59 ms) n/a (short reply)
Long text generation reference 30 1 x 128 159.91 ms avg (159.34-160.49 ms) 17.96 token/s avg

The packaged runtime uses the following context layout:

  • prefill_len=128
  • kv_cache_len=2047
  • prefill_max_token_num=1280

The Long text generation reference row is the recommended sustained text-only decode figure for this package. Very short replies under-report decode speed because EOS and response-tail overhead become relatively larger.

Startup Runtime Footprint

Item Value
Flash total (24 text axmodels + post axmodel + embedding bin) 1.42 GiB (1456.71 MiB)
Package flash total (excluding .git/) 1.43 GiB (1464.24 MiB)
Runtime CMM requirement Board-dependent; validate on your target AX650 CMM pool

On the validated AX650 board, the packaged startup log confirmed max_token_len=2047, prefill_len=128, and prefill_max_token_num=1280. This README does not present one board's remain_cmm(...) value as a package-wide memory requirement, because the absolute remaining CMM pool depends on the board's global memory layout.

Package Layout

.
โ”œโ”€โ”€ README.md
โ”œโ”€โ”€ bin/
โ”‚   โ”œโ”€โ”€ axllm
โ”‚   โ””โ”€โ”€ axllm.version.json
โ”œโ”€โ”€ config.json
โ”œโ”€โ”€ post_config.json
โ”œโ”€โ”€ minicpm5_tokenizer.txt
โ”œโ”€โ”€ model.embed_tokens.weight.bfloat16.bin
โ”œโ”€โ”€ llama_p128_l0_together.axmodel
โ”œโ”€โ”€ ...
โ”œโ”€โ”€ llama_p128_l23_together.axmodel
โ””โ”€โ”€ llama_post.axmodel

This package uses a flat runtime layout. The packaged axllm binary reads the root-level runtime files directly, so no extra path arguments are required when you serve the repository root.

Direct Inference with axllm

Download the Model Package

Download the release package from Hugging Face:

mkdir -p AXERA-TECH/MiniCPM5-1B
cd AXERA-TECH/MiniCPM5-1B
hf download AXERA-TECH/MiniCPM5-1B --local-dir .

Install axllm

Option 1: use the validated binary included in this repository:

chmod +x ./bin/axllm

Option 2: install axllm from the public repository:

git clone -b axllm https://github.com/AXERA-TECH/ax-llm.git
cd ax-llm
./install.sh

Option 3: install with a one-line command:

curl -fsSL https://raw.githubusercontent.com/AXERA-TECH/ax-llm/axllm/install.sh | bash

Option 4: download the prebuilt binary from GitHub Actions CI:

If you do not have a local build environment, download the latest CI-generated axllm binary from GitHub Actions: https://github.com/AXERA-TECH/ax-llm/actions?query=branch%3Aaxllm Then run:

chmod +x axllm
sudo mv axllm /usr/bin/axllm

Run on the Board

This package already includes a validated bin/axllm binary for AX650.

From the package root on the board:

chmod +x ./bin/axllm
./bin/axllm serve . --port 8000

Expected model id:

AXERA-TECH/MiniCPM5-1B-AX650-C128-P1152-CTX2047

Health check and model listing:

curl http://127.0.0.1:8000/health
curl http://127.0.0.1:8000/v1/models

Example health output:

{
  "concurrency": 0,
  "max_concurrency": 1,
  "status": "healthy"
}

Example model list output:

{
  "data": [
    {
      "created": 1780908633,
      "id": "AXERA-TECH/MiniCPM5-1B-AX650-C128-P1152-CTX2047",
      "object": "model",
      "owned_by": "openai-api"
    }
  ],
  "object": "list"
}

Text Request

By default, this package uses no-thinking mode because the packaged config.json sets enable_thinking=false.

curl http://127.0.0.1:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "AXERA-TECH/MiniCPM5-1B-AX650-C128-P1152-CTX2047",
    "messages": [
      {
        "role": "user",
        "content": "ไธญๅ›ฝ็š„้ฆ–้ƒฝๆ˜ฏๅ“ช้‡Œ๏ผŸ่ฏทๅชๅ›ž็ญ”ๅŸŽๅธ‚ๅใ€‚"
      }
    ],
    "max_tokens": 32,
    "temperature": 0
  }'

Example output:

{
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": "ๅŒ—ไบฌ"
      },
      "finish_reason": "stop"
    }
  ],
  "model": "AXERA-TECH/MiniCPM5-1B-AX650-C128-P1152-CTX2047",
  "object": "chat.completion"
}

Enable Thinking Per Request

To enable explicit reasoning output for a single request, pass top-level enable_thinking=true:

curl http://127.0.0.1:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "AXERA-TECH/MiniCPM5-1B-AX650-C128-P1152-CTX2047",
    "messages": [
      {
        "role": "user",
        "content": "ไธญๅ›ฝ็š„้ฆ–้ƒฝๆ˜ฏๅ“ช้‡Œ๏ผŸ่ฏท็ฎ€็Ÿญๆ€่€ƒๅŽ็ป™ๆœ€็ปˆ็ญ”ๆกˆใ€‚"
      }
    ],
    "enable_thinking": true,
    "max_tokens": 384,
    "temperature": 0
  }'

Typical output shape:

{
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": "<think>\n...\n</think>\n\nไธญๅ›ฝ็š„้ฆ–้ƒฝๆ˜ฏๅŒ—ไบฌใ€‚"
      },
      "finish_reason": "stop"
    }
  ],
  "model": "AXERA-TECH/MiniCPM5-1B-AX650-C128-P1152-CTX2047",
  "object": "chat.completion"
}

The Hugging Face-style request form is also accepted:

{
  "chat_template_kwargs": {
    "enable_thinking": true
  }
}

When thinking mode is enabled, the service returns client-visible <think>...</think> markup so front ends can render reasoning and final answer separately. Follow-up turns also keep the official MiniCPM5 template behavior: previous assistant reasoning content is not reinserted into the next user prompt.

Browser UI with lite_webui

If you want a browser UI for the OpenAI-compatible service started by axllm serve, use AXERA-TECH/lite_webui.

Set the OpenAI base URL to http://<board-ip>:8000 and the model name to AXERA-TECH/MiniCPM5-1B-AX650-C128-P1152-CTX2047.

Conversion References

If you need the original model files or want to rebuild the deployment artifacts, start with:

Discussion

  • GitHub Issues
  • QQ group: 139953715
Downloads last month
47
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for AXERA-TECH/MiniCPM5-1B

Finetuned
(19)
this model

Collection including AXERA-TECH/MiniCPM5-1B