MiniCPM5-1B β€” Core AI (int8, runs on iPhone)

Apple Core AI (.aimodel) conversion of openbmb/MiniCPM5-1B β€” OpenBMB's 1.08B on-device LLM with hybrid Think / No-Think reasoning and 128K context, reaching 1B-class open-source SOTA. Runs fully on-device on iPhone and Apple Silicon Macs (GPU, pipelined engine).

Part of the community Core AI model zoo: https://github.com/john-rocky/coreai-model-zoo

On-device numbers (iPhone 17 Pro, A19 Pro)

Measured with the zoo's PipelinedBench (random 128-token prompt, greedy):

decode prefill quality size engine-ready
int8/ (ship) 66.8 tok/s 68.0 tok/s lossless (24/24 token-exact vs HF fp32) 1.0 GB 2.0 s

int8 is ~2.2Γ— faster than fp16 on iPhone (decode is memory-bandwidth-bound, so halving the weight read β‰ˆ doubles throughput) at no quality cost β€” the device greedy output is token-for-token identical to the fp32 reference on the benchmark prompts. So int8 strictly dominates fp16 here.

Quantization

Weight-only symmetric per-channel int8 (absmax, no clipping β€” clipping craters the 130k-vocab LM head; absmax keeps it lossless), applied as a torch pre-export pass via coreai-opt; SDPA / RoPE / RMSNorm stay full precision. Same recipe family as the zoo's proven sym8.

uv run coreai.llm.export openbmb/MiniCPM5-1B --experimental --compute-precision float16 \
  --compression-config minicpm5_int8sym.yaml
# minicpm5_int8sym.yaml: quantization_config β†’ op_state_spec.weight = {dtype: int8,
#   qscheme: symmetric, granularity: {type: per_channel, axis: 0}}

Conversion notes

  • llama β†’ mistral remap. MiniCPM5-1B's model_type is llama; the stock exporter has no llama graph family, but Mistral's builder is architecturally identical for this config (GQA, no qkv bias, no qk-norm, explicit head_dim honored). One-line remap in the model registry.
  • Chat EOS. Base eos_token is </s>, but the chat template ends turns with <|im_end|> (id 130073). The bundle's tokenizer eos_token is set to <|im_end|> (as Qwen ships) so generation halts cleanly.
  • Dynamic-shape bundle β†’ the Core AI pipelined engine (the iPhone path); a static iOS export routes to the static-shape engine instead, which this FM-format bundle doesn't target.

Run

// iOS / macOS, via Foundation Models
import FoundationModels
import CoreAILanguageModels
let model = try await CoreAILanguageModel(resourcesAt: modelURL)   // int8/ bundle
let session = LanguageModelSession(model: model)
print(try await session.respond(to: "Explain on-device AI in one sentence."))

License

Apache-2.0 (upstream MiniCPM5 license). Model Β© OpenBMB β€” see https://huggingface.co/openbmb/MiniCPM5-1B. Conversion: community.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for mlboydaisuke/MiniCPM5-1B-CoreAI

Finetuned
(25)
this model