| --- |
| license: apache-2.0 |
| base_model: openbmb/MiniCPM5-1B |
| pipeline_tag: text-generation |
| library_name: core-ai |
| tags: |
| - core-ai |
| - coreml |
| - apple |
| - on-device |
| - iphone |
| - metal |
| --- |
| |
| # MiniCPM5-1B β Core AI (int8, runs on iPhone) |
|
|
| Apple **Core AI** (`.aimodel`) conversion of [openbmb/MiniCPM5-1B](https://huggingface.co/openbmb/MiniCPM5-1B) β |
| OpenBMB's 1.08B on-device LLM with **hybrid Think / No-Think reasoning** and **128K** context, reaching |
| 1B-class open-source SOTA. Runs fully on-device on **iPhone** and Apple Silicon Macs (GPU, pipelined engine). |
|
|
| Part of the community Core AI model zoo: **https://github.com/john-rocky/coreai-model-zoo** |
|
|
| ## On-device numbers (iPhone 17 Pro, A19 Pro) |
|
|
| Measured with the zoo's `PipelinedBench` (random 128-token prompt, greedy): |
|
|
| | | decode | prefill | quality | size | engine-ready | |
| |---|---:|---:|---|---:|---:| |
| | **`int8/`** (ship) | **66.8 tok/s** | 68.0 tok/s | **lossless** (24/24 token-exact vs HF fp32) | **1.0 GB** | 2.0 s | |
|
|
| `int8` is **~2.2Γ faster than fp16** on iPhone (decode is memory-bandwidth-bound, so halving the |
| weight read β doubles throughput) at **no quality cost** β the device greedy output is token-for-token |
| identical to the fp32 reference on the benchmark prompts. So int8 strictly dominates fp16 here. |
|
|
| ## Quantization |
|
|
| Weight-only **symmetric per-channel int8** (absmax, no clipping β clipping craters the 130k-vocab LM |
| head; absmax keeps it lossless), applied as a torch pre-export pass via `coreai-opt`; SDPA / RoPE / |
| RMSNorm stay full precision. Same recipe family as the zoo's proven `sym8`. |
|
|
| ```bash |
| uv run coreai.llm.export openbmb/MiniCPM5-1B --experimental --compute-precision float16 \ |
| --compression-config minicpm5_int8sym.yaml |
| # minicpm5_int8sym.yaml: quantization_config β op_state_spec.weight = {dtype: int8, |
| # qscheme: symmetric, granularity: {type: per_channel, axis: 0}} |
| ``` |
|
|
| ## Conversion notes |
|
|
| - **`llama β mistral` remap.** MiniCPM5-1B's `model_type` is `llama`; the stock exporter has no |
| `llama` graph family, but Mistral's builder is architecturally identical for this config (GQA, |
| no qkv bias, no qk-norm, explicit `head_dim` honored). One-line remap in the model registry. |
| - **Chat EOS.** Base `eos_token` is `</s>`, but the chat template ends turns with `<|im_end|>` |
| (id 130073). The bundle's tokenizer `eos_token` is set to `<|im_end|>` (as Qwen ships) so |
| generation halts cleanly. |
| - **Dynamic-shape bundle** β the Core AI pipelined engine (the iPhone path); a static iOS export |
| routes to the static-shape engine instead, which this FM-format bundle doesn't target. |
|
|
| ## Run |
|
|
| ```swift |
| // iOS / macOS, via Foundation Models |
| import FoundationModels |
| import CoreAILanguageModels |
| let model = try await CoreAILanguageModel(resourcesAt: modelURL) // int8/ bundle |
| let session = LanguageModelSession(model: model) |
| print(try await session.respond(to: "Explain on-device AI in one sentence.")) |
| ``` |
|
|
| ## License |
|
|
| Apache-2.0 (upstream MiniCPM5 license). Model Β© OpenBMB β see |
| https://huggingface.co/openbmb/MiniCPM5-1B. Conversion: community. |
|
|