MiniCPM5-1B int8 (Core AI) — iPhone 66.8 tok/s, lossless

6ec16ba verified about 19 hours ago

3.06 kB

	---
	license: apache-2.0
	base_model: openbmb/MiniCPM5-1B
	pipeline_tag: text-generation
	library_name: core-ai
	tags:
	- core-ai
	- coreml
	- apple
	- on-device
	- iphone
	- metal
	---

	# MiniCPM5-1B — Core AI (int8, runs on iPhone)

	Apple Core AI (`.aimodel`) conversion of [openbmb/MiniCPM5-1B](https://huggingface.co/openbmb/MiniCPM5-1B) —
	OpenBMB's 1.08B on-device LLM with hybrid Think / No-Think reasoning and 128K context, reaching
	1B-class open-source SOTA. Runs fully on-device on iPhone and Apple Silicon Macs (GPU, pipelined engine).

	Part of the community Core AI model zoo: https://github.com/john-rocky/coreai-model-zoo

	## On-device numbers (iPhone 17 Pro, A19 Pro)

	Measured with the zoo's `PipelinedBench` (random 128-token prompt, greedy):

	\| \| decode \| prefill \| quality \| size \| engine-ready \|
	\|---\|---:\|---:\|---\|---:\|---:\|
	\| `int8/` (ship) \| 66.8 tok/s \| 68.0 tok/s \| lossless (24/24 token-exact vs HF fp32) \| 1.0 GB \| 2.0 s \|

	`int8` is ~2.2× faster than fp16 on iPhone (decode is memory-bandwidth-bound, so halving the
	weight read ≈ doubles throughput) at no quality cost — the device greedy output is token-for-token
	identical to the fp32 reference on the benchmark prompts. So int8 strictly dominates fp16 here.

	## Quantization

	Weight-only symmetric per-channel int8 (absmax, no clipping — clipping craters the 130k-vocab LM
	head; absmax keeps it lossless), applied as a torch pre-export pass via `coreai-opt`; SDPA / RoPE /
	RMSNorm stay full precision. Same recipe family as the zoo's proven `sym8`.

	```bash
	uv run coreai.llm.export openbmb/MiniCPM5-1B --experimental --compute-precision float16 \
	--compression-config minicpm5_int8sym.yaml
	# minicpm5_int8sym.yaml: quantization_config → op_state_spec.weight = {dtype: int8,
	# qscheme: symmetric, granularity: {type: per_channel, axis: 0}}
	```

	## Conversion notes

	- `llama → mistral` remap. MiniCPM5-1B's `model_type` is `llama`; the stock exporter has no
	`llama` graph family, but Mistral's builder is architecturally identical for this config (GQA,
	no qkv bias, no qk-norm, explicit `head_dim` honored). One-line remap in the model registry.
	- Chat EOS. Base `eos_token` is `</s>`, but the chat template ends turns with `<\|im_end\|>`
	(id 130073). The bundle's tokenizer `eos_token` is set to `<\|im_end\|>` (as Qwen ships) so
	generation halts cleanly.
	- Dynamic-shape bundle → the Core AI pipelined engine (the iPhone path); a static iOS export
	routes to the static-shape engine instead, which this FM-format bundle doesn't target.

	## Run

	```swift
	// iOS / macOS, via Foundation Models
	import FoundationModels
	import CoreAILanguageModels
	let model = try await CoreAILanguageModel(resourcesAt: modelURL) // int8/ bundle
	let session = LanguageModelSession(model: model)
	print(try await session.respond(to: "Explain on-device AI in one sentence."))
	```

	## License

	Apache-2.0 (upstream MiniCPM5 license). Model © OpenBMB — see
	https://huggingface.co/openbmb/MiniCPM5-1B. Conversion: community.