File size: 3,059 Bytes
6ec16ba
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
---
license: apache-2.0
base_model: openbmb/MiniCPM5-1B
pipeline_tag: text-generation
library_name: core-ai
tags:
- core-ai
- coreml
- apple
- on-device
- iphone
- metal
---

# MiniCPM5-1B — Core AI (int8, runs on iPhone)

Apple **Core AI** (`.aimodel`) conversion of [openbmb/MiniCPM5-1B](https://huggingface.co/openbmb/MiniCPM5-1B) —
OpenBMB's 1.08B on-device LLM with **hybrid Think / No-Think reasoning** and **128K** context, reaching
1B-class open-source SOTA. Runs fully on-device on **iPhone** and Apple Silicon Macs (GPU, pipelined engine).

Part of the community Core AI model zoo: **https://github.com/john-rocky/coreai-model-zoo**

## On-device numbers (iPhone 17 Pro, A19 Pro)

Measured with the zoo's `PipelinedBench` (random 128-token prompt, greedy):

| | decode | prefill | quality | size | engine-ready |
|---|---:|---:|---|---:|---:|
| **`int8/`** (ship) | **66.8 tok/s** | 68.0 tok/s | **lossless** (24/24 token-exact vs HF fp32) | **1.0 GB** | 2.0 s |

`int8` is **~2.2× faster than fp16** on iPhone (decode is memory-bandwidth-bound, so halving the
weight read ≈ doubles throughput) at **no quality cost** — the device greedy output is token-for-token
identical to the fp32 reference on the benchmark prompts. So int8 strictly dominates fp16 here.

## Quantization

Weight-only **symmetric per-channel int8** (absmax, no clipping — clipping craters the 130k-vocab LM
head; absmax keeps it lossless), applied as a torch pre-export pass via `coreai-opt`; SDPA / RoPE /
RMSNorm stay full precision. Same recipe family as the zoo's proven `sym8`.

```bash
uv run coreai.llm.export openbmb/MiniCPM5-1B --experimental --compute-precision float16 \
  --compression-config minicpm5_int8sym.yaml
# minicpm5_int8sym.yaml: quantization_config → op_state_spec.weight = {dtype: int8,
#   qscheme: symmetric, granularity: {type: per_channel, axis: 0}}
```

## Conversion notes

- **`llama → mistral` remap.** MiniCPM5-1B's `model_type` is `llama`; the stock exporter has no
  `llama` graph family, but Mistral's builder is architecturally identical for this config (GQA,
  no qkv bias, no qk-norm, explicit `head_dim` honored). One-line remap in the model registry.
- **Chat EOS.** Base `eos_token` is `</s>`, but the chat template ends turns with `<|im_end|>`
  (id 130073). The bundle's tokenizer `eos_token` is set to `<|im_end|>` (as Qwen ships) so
  generation halts cleanly.
- **Dynamic-shape bundle** → the Core AI pipelined engine (the iPhone path); a static iOS export
  routes to the static-shape engine instead, which this FM-format bundle doesn't target.

## Run

```swift
// iOS / macOS, via Foundation Models
import FoundationModels
import CoreAILanguageModels
let model = try await CoreAILanguageModel(resourcesAt: modelURL)   // int8/ bundle
let session = LanguageModelSession(model: model)
print(try await session.respond(to: "Explain on-device AI in one sentence."))
```

## License

Apache-2.0 (upstream MiniCPM5 license). Model © OpenBMB — see
https://huggingface.co/openbmb/MiniCPM5-1B. Conversion: community.