Instructions to use AXERA-TECH/MiniCPM5-1B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use AXERA-TECH/MiniCPM5-1B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="AXERA-TECH/MiniCPM5-1B")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("AXERA-TECH/MiniCPM5-1B", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use AXERA-TECH/MiniCPM5-1B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "AXERA-TECH/MiniCPM5-1B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AXERA-TECH/MiniCPM5-1B", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/AXERA-TECH/MiniCPM5-1B
- SGLang
How to use AXERA-TECH/MiniCPM5-1B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "AXERA-TECH/MiniCPM5-1B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AXERA-TECH/MiniCPM5-1B", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "AXERA-TECH/MiniCPM5-1B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AXERA-TECH/MiniCPM5-1B", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use AXERA-TECH/MiniCPM5-1B with Docker Model Runner:
docker model run hf.co/AXERA-TECH/MiniCPM5-1B
| library_name: transformers | |
| license: apache-2.0 | |
| base_model: | |
| - openbmb/MiniCPM5-1B | |
| pipeline_tag: text-generation | |
| tags: | |
| - minicpm5 | |
| - llm | |
| - thinking | |
| - axera | |
| - ax650 | |
| language: | |
| - en | |
| - zh | |
| # MiniCPM5-1B on AXERA NPU | |
| Ready-to-run deployment package for `openbmb/MiniCPM5-1B` on AX650 / NPU3. | |
| - This release packages the AX650 `axllm` runtime together with the compiled text `.axmodel` files. | |
| - The packaged runtime is configured for text-only inference on AX650 / NPU3. | |
| - The packaged context layout is `prefill_len=128`, `kv_cache_len=2047`, and `prefill_max_token_num=1280`. | |
| - Thinking is disabled by default and can be enabled per request through the public OpenAI-compatible API. | |
| - The package includes the tokenizer, runtime config files, and the validated `bin/axllm` binary for board-side deployment. | |
| ## Supported Platform | |
| - [x] AX650 / NPU3 | |
| ## Validated Devices | |
| This package has been validated on the following AX650-based device: | |
| - AX650 / NPU3 development board | |
| ## Performance | |
| All measurements below were taken on AX650 / NPU3 with the packaged `axllm` runtime. `TTFT` stands for time to first token. In this table, `TTFT` is measured end-to-end from request arrival at `axllm serve` to the first generated token. | |
| The validated text prompts below were kept within one `128`-token prefill chunk. To avoid one-time startup effects, each `TTFT` row excludes the first request for that prompt pattern. | |
| | Scenario | Input tokens | Prefill chunks | TTFT | Decode | | |
| |---|---:|---:|---:|---:| | |
| | Text smoke prompt | `24` | `1 x 128` | `160.34 ms avg` (`159.40-161.28 ms`) | `n/a (single-token reply)` | | |
| | Short front-end prompt | `14` | `1 x 128` | `157.76 ms avg` (`157.68-157.84 ms`) | `n/a (short reply)` | | |
| | Multi-turn text prompt | `40` | `1 x 128` | `159.89 ms avg` (`159.19-160.59 ms`) | `n/a (short reply)` | | |
| | Long text generation reference | `30` | `1 x 128` | `159.91 ms avg` (`159.34-160.49 ms`) | `17.96 token/s avg` | | |
| The packaged runtime uses the following context layout: | |
| - `prefill_len=128` | |
| - `kv_cache_len=2047` | |
| - `prefill_max_token_num=1280` | |
| The `Long text generation reference` row is the recommended sustained text-only decode figure for this package. Very short replies under-report decode speed because EOS and response-tail overhead become relatively larger. | |
| ## Startup Runtime Footprint | |
| | Item | Value | | |
| |---|---:| | |
| | `Flash total (24 text axmodels + post axmodel + embedding bin)` | `1.42 GiB` (`1456.71 MiB`) | | |
| | `Package flash total (excluding .git/)` | `1.43 GiB` (`1464.24 MiB`) | | |
| | `Runtime CMM requirement` | `Board-dependent; validate on your target AX650 CMM pool` | | |
| On the validated AX650 board, the packaged startup log confirmed `max_token_len=2047`, `prefill_len=128`, and `prefill_max_token_num=1280`. This README does not present one board's `remain_cmm(...)` value as a package-wide memory requirement, because the absolute remaining CMM pool depends on the board's global memory layout. | |
| ## Package Layout | |
| ```text | |
| . | |
| ├── README.md | |
| ├── bin/ | |
| │ ├── axllm | |
| │ └── axllm.version.json | |
| ├── config.json | |
| ├── post_config.json | |
| ├── minicpm5_tokenizer.txt | |
| ├── model.embed_tokens.weight.bfloat16.bin | |
| ├── llama_p128_l0_together.axmodel | |
| ├── ... | |
| ├── llama_p128_l23_together.axmodel | |
| └── llama_post.axmodel | |
| ``` | |
| This package uses a flat runtime layout. The packaged `axllm` binary reads the root-level runtime files directly, so no extra path arguments are required when you serve the repository root. | |
| ## Direct Inference with `axllm` | |
| ### Download the Model Package | |
| Download the release package from Hugging Face: | |
| ```shell | |
| mkdir -p AXERA-TECH/MiniCPM5-1B | |
| cd AXERA-TECH/MiniCPM5-1B | |
| hf download AXERA-TECH/MiniCPM5-1B --local-dir . | |
| ``` | |
| ### Install `axllm` | |
| Option 1: use the validated binary included in this repository: | |
| ```bash | |
| chmod +x ./bin/axllm | |
| ``` | |
| Option 2: install `axllm` from the public repository: | |
| ```shell | |
| git clone -b axllm https://github.com/AXERA-TECH/ax-llm.git | |
| cd ax-llm | |
| ./install.sh | |
| ``` | |
| Option 3: install with a one-line command: | |
| ```shell | |
| curl -fsSL https://raw.githubusercontent.com/AXERA-TECH/ax-llm/axllm/install.sh | bash | |
| ``` | |
| Option 4: download the prebuilt binary from GitHub Actions CI: | |
| If you do not have a local build environment, download the latest CI-generated `axllm` binary from GitHub Actions: | |
| `https://github.com/AXERA-TECH/ax-llm/actions?query=branch%3Aaxllm` | |
| Then run: | |
| ```shell | |
| chmod +x axllm | |
| sudo mv axllm /usr/bin/axllm | |
| ``` | |
| ### Run on the Board | |
| This package already includes a validated `bin/axllm` binary for AX650. | |
| From the package root on the board: | |
| ```bash | |
| chmod +x ./bin/axllm | |
| ./bin/axllm serve . --port 8000 | |
| ``` | |
| Expected model id: | |
| ```text | |
| AXERA-TECH/MiniCPM5-1B-AX650-C128-P1152-CTX2047 | |
| ``` | |
| Health check and model listing: | |
| ```bash | |
| curl http://127.0.0.1:8000/health | |
| curl http://127.0.0.1:8000/v1/models | |
| ``` | |
| Example health output: | |
| ```json | |
| { | |
| "concurrency": 0, | |
| "max_concurrency": 1, | |
| "status": "healthy" | |
| } | |
| ``` | |
| Example model list output: | |
| ```json | |
| { | |
| "data": [ | |
| { | |
| "created": 1780908633, | |
| "id": "AXERA-TECH/MiniCPM5-1B-AX650-C128-P1152-CTX2047", | |
| "object": "model", | |
| "owned_by": "openai-api" | |
| } | |
| ], | |
| "object": "list" | |
| } | |
| ``` | |
| ### Text Request | |
| By default, this package uses no-thinking mode because the packaged `config.json` sets `enable_thinking=false`. | |
| ```bash | |
| curl http://127.0.0.1:8000/v1/chat/completions \ | |
| -H 'Content-Type: application/json' \ | |
| -d '{ | |
| "model": "AXERA-TECH/MiniCPM5-1B-AX650-C128-P1152-CTX2047", | |
| "messages": [ | |
| { | |
| "role": "user", | |
| "content": "中国的首都是哪里?请只回答城市名。" | |
| } | |
| ], | |
| "max_tokens": 32, | |
| "temperature": 0 | |
| }' | |
| ``` | |
| Example output: | |
| ```json | |
| { | |
| "choices": [ | |
| { | |
| "message": { | |
| "role": "assistant", | |
| "content": "北京" | |
| }, | |
| "finish_reason": "stop" | |
| } | |
| ], | |
| "model": "AXERA-TECH/MiniCPM5-1B-AX650-C128-P1152-CTX2047", | |
| "object": "chat.completion" | |
| } | |
| ``` | |
| ### Enable Thinking Per Request | |
| To enable explicit reasoning output for a single request, pass top-level `enable_thinking=true`: | |
| ```bash | |
| curl http://127.0.0.1:8000/v1/chat/completions \ | |
| -H 'Content-Type: application/json' \ | |
| -d '{ | |
| "model": "AXERA-TECH/MiniCPM5-1B-AX650-C128-P1152-CTX2047", | |
| "messages": [ | |
| { | |
| "role": "user", | |
| "content": "中国的首都是哪里?请简短思考后给最终答案。" | |
| } | |
| ], | |
| "enable_thinking": true, | |
| "max_tokens": 384, | |
| "temperature": 0 | |
| }' | |
| ``` | |
| Typical output shape: | |
| ```json | |
| { | |
| "choices": [ | |
| { | |
| "message": { | |
| "role": "assistant", | |
| "content": "<think>\n...\n</think>\n\n中国的首都是北京。" | |
| }, | |
| "finish_reason": "stop" | |
| } | |
| ], | |
| "model": "AXERA-TECH/MiniCPM5-1B-AX650-C128-P1152-CTX2047", | |
| "object": "chat.completion" | |
| } | |
| ``` | |
| The Hugging Face-style request form is also accepted: | |
| ```json | |
| { | |
| "chat_template_kwargs": { | |
| "enable_thinking": true | |
| } | |
| } | |
| ``` | |
| When thinking mode is enabled, the service returns client-visible `<think>...</think>` markup so front ends can render reasoning and final answer separately. Follow-up turns also keep the official MiniCPM5 template behavior: previous assistant reasoning content is not reinserted into the next user prompt. | |
| ## Browser UI with `lite_webui` | |
| If you want a browser UI for the OpenAI-compatible service started by `axllm serve`, use [AXERA-TECH/lite_webui](https://huggingface.co/AXERA-TECH/lite_webui/tree/main). | |
| Set the OpenAI base URL to `http://<board-ip>:8000` and the model name to `AXERA-TECH/MiniCPM5-1B-AX650-C128-P1152-CTX2047`. | |
| ## Conversion References | |
| If you need the original model files or want to rebuild the deployment artifacts, start with: | |
| - Original Hugging Face model: [openbmb/MiniCPM5-1B](https://huggingface.co/openbmb/MiniCPM5-1B) | |
| - AXERA conversion and deployment workflow: [AXERA-TECH/MiniCPM5-1B.axera](https://github.com/AXERA-TECH/MiniCPM5-1B.axera) | |
| ## Discussion | |
| - GitHub Issues | |
| - QQ group: `139953715` | |