--- library_name: transformers license: apache-2.0 base_model: - openbmb/MiniCPM5-1B pipeline_tag: text-generation tags: - minicpm5 - llm - thinking - axera - ax650 language: - en - zh --- # MiniCPM5-1B on AXERA NPU Ready-to-run deployment package for `openbmb/MiniCPM5-1B` on AX650 / NPU3. - This release packages the AX650 `axllm` runtime together with the compiled text `.axmodel` files. - The packaged runtime is configured for text-only inference on AX650 / NPU3. - The packaged context layout is `prefill_len=128`, `kv_cache_len=2047`, and `prefill_max_token_num=1280`. - Thinking is disabled by default and can be enabled per request through the public OpenAI-compatible API. - The package includes the tokenizer, runtime config files, and the validated `bin/axllm` binary for board-side deployment. ## Supported Platform - [x] AX650 / NPU3 ## Validated Devices This package has been validated on the following AX650-based device: - AX650 / NPU3 development board ## Performance All measurements below were taken on AX650 / NPU3 with the packaged `axllm` runtime. `TTFT` stands for time to first token. In this table, `TTFT` is measured end-to-end from request arrival at `axllm serve` to the first generated token. The validated text prompts below were kept within one `128`-token prefill chunk. To avoid one-time startup effects, each `TTFT` row excludes the first request for that prompt pattern. | Scenario | Input tokens | Prefill chunks | TTFT | Decode | |---|---:|---:|---:|---:| | Text smoke prompt | `24` | `1 x 128` | `160.34 ms avg` (`159.40-161.28 ms`) | `n/a (single-token reply)` | | Short front-end prompt | `14` | `1 x 128` | `157.76 ms avg` (`157.68-157.84 ms`) | `n/a (short reply)` | | Multi-turn text prompt | `40` | `1 x 128` | `159.89 ms avg` (`159.19-160.59 ms`) | `n/a (short reply)` | | Long text generation reference | `30` | `1 x 128` | `159.91 ms avg` (`159.34-160.49 ms`) | `17.96 token/s avg` | The packaged runtime uses the following context layout: - `prefill_len=128` - `kv_cache_len=2047` - `prefill_max_token_num=1280` The `Long text generation reference` row is the recommended sustained text-only decode figure for this package. Very short replies under-report decode speed because EOS and response-tail overhead become relatively larger. ## Startup Runtime Footprint | Item | Value | |---|---:| | `Flash total (24 text axmodels + post axmodel + embedding bin)` | `1.42 GiB` (`1456.71 MiB`) | | `Package flash total (excluding .git/)` | `1.43 GiB` (`1464.24 MiB`) | | `Runtime CMM requirement` | `Board-dependent; validate on your target AX650 CMM pool` | On the validated AX650 board, the packaged startup log confirmed `max_token_len=2047`, `prefill_len=128`, and `prefill_max_token_num=1280`. This README does not present one board's `remain_cmm(...)` value as a package-wide memory requirement, because the absolute remaining CMM pool depends on the board's global memory layout. ## Package Layout ```text . ├── README.md ├── bin/ │ ├── axllm │ └── axllm.version.json ├── config.json ├── post_config.json ├── minicpm5_tokenizer.txt ├── model.embed_tokens.weight.bfloat16.bin ├── llama_p128_l0_together.axmodel ├── ... ├── llama_p128_l23_together.axmodel └── llama_post.axmodel ``` This package uses a flat runtime layout. The packaged `axllm` binary reads the root-level runtime files directly, so no extra path arguments are required when you serve the repository root. ## Direct Inference with `axllm` ### Download the Model Package Download the release package from Hugging Face: ```shell mkdir -p AXERA-TECH/MiniCPM5-1B cd AXERA-TECH/MiniCPM5-1B hf download AXERA-TECH/MiniCPM5-1B --local-dir . ``` ### Install `axllm` Option 1: use the validated binary included in this repository: ```bash chmod +x ./bin/axllm ``` Option 2: install `axllm` from the public repository: ```shell git clone -b axllm https://github.com/AXERA-TECH/ax-llm.git cd ax-llm ./install.sh ``` Option 3: install with a one-line command: ```shell curl -fsSL https://raw.githubusercontent.com/AXERA-TECH/ax-llm/axllm/install.sh | bash ``` Option 4: download the prebuilt binary from GitHub Actions CI: If you do not have a local build environment, download the latest CI-generated `axllm` binary from GitHub Actions: `https://github.com/AXERA-TECH/ax-llm/actions?query=branch%3Aaxllm` Then run: ```shell chmod +x axllm sudo mv axllm /usr/bin/axllm ``` ### Run on the Board This package already includes a validated `bin/axllm` binary for AX650. From the package root on the board: ```bash chmod +x ./bin/axllm ./bin/axllm serve . --port 8000 ``` Expected model id: ```text AXERA-TECH/MiniCPM5-1B-AX650-C128-P1152-CTX2047 ``` Health check and model listing: ```bash curl http://127.0.0.1:8000/health curl http://127.0.0.1:8000/v1/models ``` Example health output: ```json { "concurrency": 0, "max_concurrency": 1, "status": "healthy" } ``` Example model list output: ```json { "data": [ { "created": 1780908633, "id": "AXERA-TECH/MiniCPM5-1B-AX650-C128-P1152-CTX2047", "object": "model", "owned_by": "openai-api" } ], "object": "list" } ``` ### Text Request By default, this package uses no-thinking mode because the packaged `config.json` sets `enable_thinking=false`. ```bash curl http://127.0.0.1:8000/v1/chat/completions \ -H 'Content-Type: application/json' \ -d '{ "model": "AXERA-TECH/MiniCPM5-1B-AX650-C128-P1152-CTX2047", "messages": [ { "role": "user", "content": "中国的首都是哪里?请只回答城市名。" } ], "max_tokens": 32, "temperature": 0 }' ``` Example output: ```json { "choices": [ { "message": { "role": "assistant", "content": "北京" }, "finish_reason": "stop" } ], "model": "AXERA-TECH/MiniCPM5-1B-AX650-C128-P1152-CTX2047", "object": "chat.completion" } ``` ### Enable Thinking Per Request To enable explicit reasoning output for a single request, pass top-level `enable_thinking=true`: ```bash curl http://127.0.0.1:8000/v1/chat/completions \ -H 'Content-Type: application/json' \ -d '{ "model": "AXERA-TECH/MiniCPM5-1B-AX650-C128-P1152-CTX2047", "messages": [ { "role": "user", "content": "中国的首都是哪里?请简短思考后给最终答案。" } ], "enable_thinking": true, "max_tokens": 384, "temperature": 0 }' ``` Typical output shape: ```json { "choices": [ { "message": { "role": "assistant", "content": "\n...\n\n\n中国的首都是北京。" }, "finish_reason": "stop" } ], "model": "AXERA-TECH/MiniCPM5-1B-AX650-C128-P1152-CTX2047", "object": "chat.completion" } ``` The Hugging Face-style request form is also accepted: ```json { "chat_template_kwargs": { "enable_thinking": true } } ``` When thinking mode is enabled, the service returns client-visible `...` markup so front ends can render reasoning and final answer separately. Follow-up turns also keep the official MiniCPM5 template behavior: previous assistant reasoning content is not reinserted into the next user prompt. ## Browser UI with `lite_webui` If you want a browser UI for the OpenAI-compatible service started by `axllm serve`, use [AXERA-TECH/lite_webui](https://huggingface.co/AXERA-TECH/lite_webui/tree/main). Set the OpenAI base URL to `http://:8000` and the model name to `AXERA-TECH/MiniCPM5-1B-AX650-C128-P1152-CTX2047`. ## Conversion References If you need the original model files or want to rebuild the deployment artifacts, start with: - Original Hugging Face model: [openbmb/MiniCPM5-1B](https://huggingface.co/openbmb/MiniCPM5-1B) - AXERA conversion and deployment workflow: [AXERA-TECH/MiniCPM5-1B.axera](https://github.com/AXERA-TECH/MiniCPM5-1B.axera) ## Discussion - GitHub Issues - QQ group: `139953715`