| --- |
| license: other |
| license_name: lfm1.0 |
| license_link: LICENSE |
| base_model: LiquidAI/LFM2.5-1.2B-Instruct |
| tags: |
| - coreai |
| - aimodel |
| - apple-silicon |
| - on-device |
| - lfm2 |
| - hybrid |
| pipeline_tag: text-generation |
| --- |
| |
| # LFM2.5-1.2B-Instruct β Apple Core AI (`.aimodel`) |
|
|
| **LiquidAI's LFM2.5-1.2B-Instruct converted to Apple's Core AI** (the Core ML successor |
| announced at WWDC26), ready to run on iOS 27 / macOS 27. A conv + full-attention hybrid |
| (10 short-conv mixers + 6 GQA attention layers) riding Apple's **`coreai-pipelined` GPU |
| engine** β the first non-Qwen architecture on that fast path, with zero custom kernels. |
|
|
| > Requires the iOS 27 / macOS 27 beta (Core AI ships with the OS). Conversion code, knowledge |
| > base, and the Swift runner: **[coreai-model-zoo](https://github.com/john-rocky/coreai-model-zoo)**. |
|
|
| <!-- gen-cards:use-it begin id=lfm2.5-1.2b (managed by scripts/gen-cards β edit cards.json / QuickStart.swift, not this block) --> |
| ## Use it |
|
|
| βΆοΈ **Run it (source)** β the [ChatDemo runner](https://github.com/john-rocky/coreai-kit/tree/main/Examples/ChatDemo) |
| (GUI + CLI, one app for every chat model in the catalog): |
|
|
| ```bash |
| git clone https://github.com/john-rocky/coreai-kit |
| open coreai-kit/Examples/ChatDemo/ChatDemo.xcodeproj |
| # β Run, then pick "LFM2.5 1.2B" in the model picker |
| |
| # agents / headless (macOS): |
| cd coreai-kit/Examples/ChatDemo |
| swift run chat-cli --model lfm2.5-1.2b --prompt "What can you do, offline?" |
| ``` |
|
|
| π» **Build with it** β complete; the glue is kit API, copy-paste runs: |
|
|
| ```swift |
| import CoreAIKit |
| |
| let chat = try await ChatSession(catalog: "lfm2.5-1.2b") |
| let reply = try await chat.respond(to: prompt) |
| // reply: the answer, generated fully on-device |
| ``` |
|
|
| The take-home is [`Examples/ChatDemo/Sources/QuickStart.swift`](https://github.com/john-rocky/coreai-kit/blob/main/Examples/ChatDemo/Sources/QuickStart.swift) |
| β this exact code as one typed function, no UI; the CLI is an argument shell over it, and |
| the GUI drives the same `ChatSession` across turns for its transcript. |
| Multi-turn? Hold the `ChatSession` and call `respond(to:)` per turn β it keeps the |
| conversation history; `streamResponse(to:)` yields tokens as they decode. |
|
|
| **Integration checklist** |
|
|
| - SPM: `https://github.com/john-rocky/coreai-kit` β product **CoreAIKit** |
| - Info.plist: none needed |
| - Entitlements: none needed |
| - First run downloads the model β 1.7 GB (Mac) / 1.7 GB (iPhone) β then it loads from the |
| local cache (Application Support; progress via the `downloadProgress` callback) |
| - Measure in Release β Debug is ~3Γ slower on per-token host work |
| <!-- gen-cards:use-it end --> |
|
|
| ## Measured (greedy; single-step top-1 gated 16/16 vs the fp32 Hugging Face oracle) |
|
|
| | Surface | Bundle | Prefill | Decode | |
| |---|---|---:|---:| |
| | **M4 Max**, release `llm-benchmark` | β
β
β
`gpu-pipelined/lfm2_5_1_2b_instruct_decode_int8hu_block32_sym/` (1.6 GB) | 277.8 tok/s | **276.5 tok/s** | |
| | **iPhone 17 Pro**, one-shot runner | β
β
β
same bundle | 44.2β46.6 | **44.1β46.6 tok/s** | |
| | M4 Max, release `llm-benchmark` | β
β
`gpu-pipelined/lfm2_5_1_2b_instruct_decode_int8lin/` (1.5 GB) | 253.3 tok/s | **253.3 tok/s** | |
| | iPhone 17 Pro, one-shot runner | β
β
same bundle | 39.2β39.4 | **38.0β39.6 tok/s** | |
| | iPhone 17 Pro, chat app (CoreAIChat LFM mode, 200-tok turn) | int8lin bundle | 30.7 | **35.8 tok/s** | |
|
|
| - **β
β
β
= the ship config** (`int8hu_block32_sym`): int8lin + the tied lm_head untied and |
| quantized **absmax per-block-32 int8** (`symmetric`, no clipping β clipping corrupts |
| big-vocab heads). +9% on M4 Max, **+15β20% on iPhone** (44.1β46.6 β ~94β98% of the naive |
| bandwidth ceiling, ~60 GB/s Γ· ~1.27 GB/token); warm engine load 0.3 s. Greedy rollouts are |
| token-identical to the int8lin bundle on both verification prompts; oracle gate 16/16 + |
| decode step, device numerics 24/24 β‘ Mac-GPU on all 3 runs. |
| - β
β
int8lin: the fp16-head variant (what CoreAIChat currently downloads); ~87% of its |
| ceiling on iPhone. Cold GPU specialization 6.8 s, warm load 1.6 s; no AOT compile needed. |
| - iPhone greedy sequences are **24/24 token-identical to the M4 Max GPU** on both fixed |
| verification prompts (both bundles). |
| - For scale: our Qwen3.5-0.8B on the same engine does 210 tok/s on M4 Max β this 1.2B does |
| 276.5. |
| |
| ## What the bundle is |
| |
| One full **LanguageBundle** (`.aimodel` + `tokenizer/` + `metadata.json`): decode-only graph, |
| `input_ids` static `[1,1]`, position_ids + KV seq dynamic (β the engine factory selects |
| `coreai-pipelined`: async non-blocking encode, on-GPU argmax sampling, on-device KV growth). |
| Weights are **int8 linear per-block-32** (scale-multiply dequant β no LUT; k-means LUT |
| gathers measure slower on this GPU delegate) with the embedding, depthwise convs, norms, |
| and the four attention projections kept high-precision; in the β
β
β
bundle the lm_head is |
| untied and quantized absmax per-block-32 int8 too (in the β
β
bundle it stays fp16/tied). |
| Do NOT re-quantize the head per-channel: per-channel (axis-0) int8 weights are broken on |
| the current beta GPU delegate (garbage logits β delegate lowering bug, documented in the |
| zoo knowledge base). The attention projections |
| are fp32 on purpose: under a dynamic-shape graph the delegate's fp16 attention-prologue |
| matmuls lose ~1.3% relative accuracy, which LFM2.5's large q/k-norm gains amplify into wrong |
| logits β fp32 there restores layer-level exactness (+126 MB). Full write-up: |
| [`knowledge/pipelined-engine.md`](https://github.com/john-rocky/coreai-model-zoo/blob/main/knowledge/pipelined-engine.md). |
|
|
| ## Run it |
|
|
| ```bash |
| git clone https://github.com/john-rocky/coreai-model-zoo |
| git clone https://github.com/apple/coreai-models |
| git -C coreai-models apply ../coreai-model-zoo/apps/coreai-shared-product.patch \ |
| ../coreai-model-zoo/apps/coreai-pipelined-extra-states.patch |
| # (the extra-states patch lets the engine carry the conv state as a fixed-shape extra state) |
| |
| # download this bundle into coreai-models/exports/, then: |
| cd coreai-models && swift build -c release |
| COREAI_CHUNK_THRESHOLD=1 ./.build/release/llm-benchmark \ |
| --model exports/lfm2_5_1_2b_instruct_decode_int8hu_block32_sym -p 128 -g 256 -n 3 |
| ``` |
|
|
| Run contract (each of these matters): |
| - `COREAI_CHUNK_THRESHOLD=1` **before engine creation** β prefill must run as pipelined S=1 |
| steps (prompt tok/s β decode tok/s). |
| - **Never call `engine.warmup()`** on this S=1 bundle (it warms query length 256, which the |
| static `[1,1]` graph rejects). A 1-token generate after load is the warmup; |
| `llm-runner` needs `--warmup exact --warmup-length 1`. |
| - Benchmark **Release** builds only (a Debug engine measures ~3Γ slow). |
|
|
| On iPhone, the [CoreAIChat sample app](https://github.com/john-rocky/coreai-model-zoo/tree/main/apps/CoreAIChat) |
| has an **LFM** picker mode that downloads this repo in-app and chats through this bundle. |
|
|
| ## Conversion |
|
|
| Reproducible with |
| [`conversion/export_lfm2_decode_pipelined.py`](https://github.com/john-rocky/coreai-model-zoo/blob/main/conversion/export_lfm2_decode_pipelined.py) |
| (+ the `models/macos/lfm2.py` overlay) from the upstream HF checkpoint. Numerics are gated |
| the strict way: a teacher-forced S=1 sweep over a 16-position oracle prompt (top-1 vs the |
| fp32 HF reference at every position, 16/16 required) plus an oracle-cache-seeded decode |
| step β not long-rollout eyeballing. Model card with the full method and the GPU-delegate |
| findings: [`zoo/lfm2.5.md`](https://github.com/john-rocky/coreai-model-zoo/blob/main/zoo/lfm2.5.md). |
|
|
| ## License |
|
|
| The model weights derive from |
| [LiquidAI/LFM2.5-1.2B-Instruct](https://huggingface.co/LiquidAI/LFM2.5-1.2B-Instruct) and are |
| redistributed under the **LFM Open License v1.0** (see [LICENSE](LICENSE)): Apache-style |
| grants, but **Commercial Use is licensed only for entities below US$10M annual revenue** |
| (qualified non-profits exempt for non-commercial/research use). The conversion code is |
| BSD-3-Clause (zoo repo). |
|
|