yongqiang

Rewrite README for release format

adf420c 3 days ago

8.04 kB

	---
	library_name: transformers
	license: apache-2.0
	base_model:
	- openbmb/MiniCPM5-1B
	pipeline_tag: text-generation
	tags:
	- minicpm5
	- llm
	- thinking
	- axera
	- ax650
	language:
	- en
	- zh
	---

	# MiniCPM5-1B on AXERA NPU

	Ready-to-run deployment package for `openbmb/MiniCPM5-1B` on AX650 / NPU3.

	- This release packages the AX650 `axllm` runtime together with the compiled text `.axmodel` files.
	- The packaged runtime is configured for text-only inference on AX650 / NPU3.
	- The packaged context layout is `prefill_len=128`, `kv_cache_len=2047`, and `prefill_max_token_num=1280`.
	- Thinking is disabled by default and can be enabled per request through the public OpenAI-compatible API.
	- The package includes the tokenizer, runtime config files, and the validated `bin/axllm` binary for board-side deployment.

	## Supported Platform

	- [x] AX650 / NPU3

	## Validated Devices

	This package has been validated on the following AX650-based device:

	- AX650 / NPU3 development board

	## Performance

	All measurements below were taken on AX650 / NPU3 with the packaged `axllm` runtime. `TTFT` stands for time to first token. In this table, `TTFT` is measured end-to-end from request arrival at `axllm serve` to the first generated token.

	The validated text prompts below were kept within one `128`-token prefill chunk. To avoid one-time startup effects, each `TTFT` row excludes the first request for that prompt pattern.

	\| Scenario \| Input tokens \| Prefill chunks \| TTFT \| Decode \|
	\|---\|---:\|---:\|---:\|---:\|
	\| Text smoke prompt \| `24` \| `1 x 128` \| `160.34 ms avg` (`159.40-161.28 ms`) \| `n/a (single-token reply)` \|
	\| Short front-end prompt \| `14` \| `1 x 128` \| `157.76 ms avg` (`157.68-157.84 ms`) \| `n/a (short reply)` \|
	\| Multi-turn text prompt \| `40` \| `1 x 128` \| `159.89 ms avg` (`159.19-160.59 ms`) \| `n/a (short reply)` \|
	\| Long text generation reference \| `30` \| `1 x 128` \| `159.91 ms avg` (`159.34-160.49 ms`) \| `17.96 token/s avg` \|

	The packaged runtime uses the following context layout:

	- `prefill_len=128`
	- `kv_cache_len=2047`
	- `prefill_max_token_num=1280`

	The `Long text generation reference` row is the recommended sustained text-only decode figure for this package. Very short replies under-report decode speed because EOS and response-tail overhead become relatively larger.

	## Startup Runtime Footprint

	\| Item \| Value \|
	\|---\|---:\|
	\| `Flash total (24 text axmodels + post axmodel + embedding bin)` \| `1.42 GiB` (`1456.71 MiB`) \|
	\| `Package flash total (excluding .git/)` \| `1.43 GiB` (`1464.24 MiB`) \|
	\| `Runtime CMM requirement` \| `Board-dependent; validate on your target AX650 CMM pool` \|

	On the validated AX650 board, the packaged startup log confirmed `max_token_len=2047`, `prefill_len=128`, and `prefill_max_token_num=1280`. This README does not present one board's `remain_cmm(...)` value as a package-wide memory requirement, because the absolute remaining CMM pool depends on the board's global memory layout.

	## Package Layout

	```text
	.
	├── README.md
	├── bin/
	│ ├── axllm
	│ └── axllm.version.json
	├── config.json
	├── post_config.json
	├── minicpm5_tokenizer.txt
	├── model.embed_tokens.weight.bfloat16.bin
	├── llama_p128_l0_together.axmodel
	├── ...
	├── llama_p128_l23_together.axmodel
	└── llama_post.axmodel
	```

	This package uses a flat runtime layout. The packaged `axllm` binary reads the root-level runtime files directly, so no extra path arguments are required when you serve the repository root.

	## Direct Inference with `axllm`

	### Download the Model Package

	Download the release package from Hugging Face:

	```shell
	mkdir -p AXERA-TECH/MiniCPM5-1B
	cd AXERA-TECH/MiniCPM5-1B
	hf download AXERA-TECH/MiniCPM5-1B --local-dir .
	```

	### Install `axllm`

	Option 1: use the validated binary included in this repository:

	```bash
	chmod +x ./bin/axllm
	```

	Option 2: install `axllm` from the public repository:

	```shell
	git clone -b axllm https://github.com/AXERA-TECH/ax-llm.git
	cd ax-llm
	./install.sh
	```

	Option 3: install with a one-line command:

	```shell
	curl -fsSL https://raw.githubusercontent.com/AXERA-TECH/ax-llm/axllm/install.sh \| bash
	```

	Option 4: download the prebuilt binary from GitHub Actions CI:

	If you do not have a local build environment, download the latest CI-generated `axllm` binary from GitHub Actions:
	`https://github.com/AXERA-TECH/ax-llm/actions?query=branch%3Aaxllm`
	Then run:

	```shell
	chmod +x axllm
	sudo mv axllm /usr/bin/axllm
	```

	### Run on the Board

	This package already includes a validated `bin/axllm` binary for AX650.

	From the package root on the board:

	```bash
	chmod +x ./bin/axllm
	./bin/axllm serve . --port 8000
	```

	Expected model id:

	```text
	AXERA-TECH/MiniCPM5-1B-AX650-C128-P1152-CTX2047
	```

	Health check and model listing:

	```bash
	curl http://127.0.0.1:8000/health
	curl http://127.0.0.1:8000/v1/models
	```

	Example health output:

	```json
	{
	"concurrency": 0,
	"max_concurrency": 1,
	"status": "healthy"
	}
	```

	Example model list output:

	```json
	{
	"data": [
	{
	"created": 1780908633,
	"id": "AXERA-TECH/MiniCPM5-1B-AX650-C128-P1152-CTX2047",
	"object": "model",
	"owned_by": "openai-api"
	}
	],
	"object": "list"
	}
	```

	### Text Request

	By default, this package uses no-thinking mode because the packaged `config.json` sets `enable_thinking=false`.

	```bash
	curl http://127.0.0.1:8000/v1/chat/completions \
	-H 'Content-Type: application/json' \
	-d '{
	"model": "AXERA-TECH/MiniCPM5-1B-AX650-C128-P1152-CTX2047",
	"messages": [
	{
	"role": "user",
	"content": "中国的首都是哪里？请只回答城市名。"
	}
	],
	"max_tokens": 32,
	"temperature": 0
	}'
	```

	Example output:

	```json
	{
	"choices": [
	{
	"message": {
	"role": "assistant",
	"content": "北京"
	},
	"finish_reason": "stop"
	}
	],
	"model": "AXERA-TECH/MiniCPM5-1B-AX650-C128-P1152-CTX2047",
	"object": "chat.completion"
	}
	```

	### Enable Thinking Per Request

	To enable explicit reasoning output for a single request, pass top-level `enable_thinking=true`:

	```bash
	curl http://127.0.0.1:8000/v1/chat/completions \
	-H 'Content-Type: application/json' \
	-d '{
	"model": "AXERA-TECH/MiniCPM5-1B-AX650-C128-P1152-CTX2047",
	"messages": [
	{
	"role": "user",
	"content": "中国的首都是哪里？请简短思考后给最终答案。"
	}
	],
	"enable_thinking": true,
	"max_tokens": 384,
	"temperature": 0
	}'
	```

	Typical output shape:

	```json
	{
	"choices": [
	{
	"message": {
	"role": "assistant",
	"content": "<think>\n...\n</think>\n\n中国的首都是北京。"
	},
	"finish_reason": "stop"
	}
	],
	"model": "AXERA-TECH/MiniCPM5-1B-AX650-C128-P1152-CTX2047",
	"object": "chat.completion"
	}
	```

	The Hugging Face-style request form is also accepted:

	```json
	{
	"chat_template_kwargs": {
	"enable_thinking": true
	}
	}
	```

	When thinking mode is enabled, the service returns client-visible `<think>...</think>` markup so front ends can render reasoning and final answer separately. Follow-up turns also keep the official MiniCPM5 template behavior: previous assistant reasoning content is not reinserted into the next user prompt.

	## Browser UI with `lite_webui`

	If you want a browser UI for the OpenAI-compatible service started by `axllm serve`, use [AXERA-TECH/lite_webui](https://huggingface.co/AXERA-TECH/lite_webui/tree/main).

	Set the OpenAI base URL to `http://<board-ip>:8000` and the model name to `AXERA-TECH/MiniCPM5-1B-AX650-C128-P1152-CTX2047`.

	## Conversion References

	If you need the original model files or want to rebuild the deployment artifacts, start with:

	- Original Hugging Face model: [openbmb/MiniCPM5-1B](https://huggingface.co/openbmb/MiniCPM5-1B)
	- AXERA conversion and deployment workflow: [AXERA-TECH/MiniCPM5-1B.axera](https://github.com/AXERA-TECH/MiniCPM5-1B.axera)

	## Discussion

	- GitHub Issues
	- QQ group: `139953715`