---
library_name: transformers
license: apache-2.0
base_model:
  - openbmb/MiniCPM5-1B
pipeline_tag: text-generation
tags:
  - minicpm5
  - llm
  - thinking
  - axera
  - ax650
language:
  - en
  - zh
---

# MiniCPM5-1B on AXERA NPU

Ready-to-run deployment package for `openbmb/MiniCPM5-1B` on AX650 / NPU3.

- This release packages the AX650 `axllm` runtime together with the compiled text `.axmodel` files.
- The packaged runtime is configured for text-only inference on AX650 / NPU3.
- The packaged context layout is `prefill_len=128`, `kv_cache_len=2047`, and `prefill_max_token_num=1280`.
- Thinking is disabled by default and can be enabled per request through the public OpenAI-compatible API.
- The package includes the tokenizer, runtime config files, and the validated `bin/axllm` binary for board-side deployment.

## Supported Platform

- [x] AX650 / NPU3

## Validated Devices

This package has been validated on the following AX650-based device:

- AX650 / NPU3 development board

## Performance

All measurements below were taken on AX650 / NPU3 with the packaged `axllm` runtime. `TTFT` stands for time to first token. In this table, `TTFT` is measured end-to-end from request arrival at `axllm serve` to the first generated token.

The validated text prompts below were kept within one `128`-token prefill chunk. To avoid one-time startup effects, each `TTFT` row excludes the first request for that prompt pattern.

| Scenario | Input tokens | Prefill chunks | TTFT | Decode |
|---|---:|---:|---:|---:|
| Text smoke prompt | `24` | `1 x 128` | `160.34 ms avg` (`159.40-161.28 ms`) | `n/a (single-token reply)` |
| Short front-end prompt | `14` | `1 x 128` | `157.76 ms avg` (`157.68-157.84 ms`) | `n/a (short reply)` |
| Multi-turn text prompt | `40` | `1 x 128` | `159.89 ms avg` (`159.19-160.59 ms`) | `n/a (short reply)` |
| Long text generation reference | `30` | `1 x 128` | `159.91 ms avg` (`159.34-160.49 ms`) | `17.96 token/s avg` |

The packaged runtime uses the following context layout:

- `prefill_len=128`
- `kv_cache_len=2047`
- `prefill_max_token_num=1280`

The `Long text generation reference` row is the recommended sustained text-only decode figure for this package. Very short replies under-report decode speed because EOS and response-tail overhead become relatively larger.

## Startup Runtime Footprint

| Item | Value |
|---|---:|
| `Flash total (24 text axmodels + post axmodel + embedding bin)` | `1.42 GiB` (`1456.71 MiB`) |
| `Package flash total (excluding .git/)` | `1.43 GiB` (`1464.24 MiB`) |
| `Runtime CMM requirement` | `Board-dependent; validate on your target AX650 CMM pool` |

On the validated AX650 board, the packaged startup log confirmed `max_token_len=2047`, `prefill_len=128`, and `prefill_max_token_num=1280`. This README does not present one board's `remain_cmm(...)` value as a package-wide memory requirement, because the absolute remaining CMM pool depends on the board's global memory layout.

## Package Layout

```text
.
├── README.md
├── bin/
│   ├── axllm
│   └── axllm.version.json
├── config.json
├── post_config.json
├── minicpm5_tokenizer.txt
├── model.embed_tokens.weight.bfloat16.bin
├── llama_p128_l0_together.axmodel
├── ...
├── llama_p128_l23_together.axmodel
└── llama_post.axmodel
```

This package uses a flat runtime layout. The packaged `axllm` binary reads the root-level runtime files directly, so no extra path arguments are required when you serve the repository root.

## Direct Inference with `axllm`

### Download the Model Package

Download the release package from Hugging Face:

```shell
mkdir -p AXERA-TECH/MiniCPM5-1B
cd AXERA-TECH/MiniCPM5-1B
hf download AXERA-TECH/MiniCPM5-1B --local-dir .
```

### Install `axllm`

Option 1: use the validated binary included in this repository:

```bash
chmod +x ./bin/axllm
```

Option 2: install `axllm` from the public repository:

```shell
git clone -b axllm https://github.com/AXERA-TECH/ax-llm.git
cd ax-llm
./install.sh
```

Option 3: install with a one-line command:

```shell
curl -fsSL https://raw.githubusercontent.com/AXERA-TECH/ax-llm/axllm/install.sh | bash
```

Option 4: download the prebuilt binary from GitHub Actions CI:

If you do not have a local build environment, download the latest CI-generated `axllm` binary from GitHub Actions:
`https://github.com/AXERA-TECH/ax-llm/actions?query=branch%3Aaxllm`
Then run:

```shell
chmod +x axllm
sudo mv axllm /usr/bin/axllm
```

### Run on the Board

This package already includes a validated `bin/axllm` binary for AX650.

From the package root on the board:

```bash
chmod +x ./bin/axllm
./bin/axllm serve . --port 8000
```

Expected model id:

```text
AXERA-TECH/MiniCPM5-1B-AX650-C128-P1152-CTX2047
```

Health check and model listing:

```bash
curl http://127.0.0.1:8000/health
curl http://127.0.0.1:8000/v1/models
```

Example health output:

```json
{
  "concurrency": 0,
  "max_concurrency": 1,
  "status": "healthy"
}
```

Example model list output:

```json
{
  "data": [
    {
      "created": 1780908633,
      "id": "AXERA-TECH/MiniCPM5-1B-AX650-C128-P1152-CTX2047",
      "object": "model",
      "owned_by": "openai-api"
    }
  ],
  "object": "list"
}
```

### Text Request

By default, this package uses no-thinking mode because the packaged `config.json` sets `enable_thinking=false`.

```bash
curl http://127.0.0.1:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "AXERA-TECH/MiniCPM5-1B-AX650-C128-P1152-CTX2047",
    "messages": [
      {
        "role": "user",
        "content": "中国的首都是哪里？请只回答城市名。"
      }
    ],
    "max_tokens": 32,
    "temperature": 0
  }'
```

Example output:

```json
{
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": "北京"
      },
      "finish_reason": "stop"
    }
  ],
  "model": "AXERA-TECH/MiniCPM5-1B-AX650-C128-P1152-CTX2047",
  "object": "chat.completion"
}
```

### Enable Thinking Per Request

To enable explicit reasoning output for a single request, pass top-level `enable_thinking=true`:

```bash
curl http://127.0.0.1:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "AXERA-TECH/MiniCPM5-1B-AX650-C128-P1152-CTX2047",
    "messages": [
      {
        "role": "user",
        "content": "中国的首都是哪里？请简短思考后给最终答案。"
      }
    ],
    "enable_thinking": true,
    "max_tokens": 384,
    "temperature": 0
  }'
```

Typical output shape:

```json
{
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": "<think>\n...\n</think>\n\n中国的首都是北京。"
      },
      "finish_reason": "stop"
    }
  ],
  "model": "AXERA-TECH/MiniCPM5-1B-AX650-C128-P1152-CTX2047",
  "object": "chat.completion"
}
```

The Hugging Face-style request form is also accepted:

```json
{
  "chat_template_kwargs": {
    "enable_thinking": true
  }
}
```

When thinking mode is enabled, the service returns client-visible `<think>...</think>` markup so front ends can render reasoning and final answer separately. Follow-up turns also keep the official MiniCPM5 template behavior: previous assistant reasoning content is not reinserted into the next user prompt.

## Browser UI with `lite_webui`

If you want a browser UI for the OpenAI-compatible service started by `axllm serve`, use [AXERA-TECH/lite_webui](https://huggingface.co/AXERA-TECH/lite_webui/tree/main).

Set the OpenAI base URL to `http://<board-ip>:8000` and the model name to `AXERA-TECH/MiniCPM5-1B-AX650-C128-P1152-CTX2047`.

## Conversion References

If you need the original model files or want to rebuild the deployment artifacts, start with:

- Original Hugging Face model: [openbmb/MiniCPM5-1B](https://huggingface.co/openbmb/MiniCPM5-1B)
- AXERA conversion and deployment workflow: [AXERA-TECH/MiniCPM5-1B.axera](https://github.com/AXERA-TECH/MiniCPM5-1B.axera)

## Discussion

- GitHub Issues
- QQ group: `139953715`