LFM2.5-1.2B-CoreAI / README.md
mlboydaisuke's picture
gen-cards: regenerate Use-it block
c850513 verified
|
Raw
History Blame Contribute Delete
7.99 kB
---
license: other
license_name: lfm1.0
license_link: LICENSE
base_model: LiquidAI/LFM2.5-1.2B-Instruct
tags:
- coreai
- aimodel
- apple-silicon
- on-device
- lfm2
- hybrid
pipeline_tag: text-generation
---
# LFM2.5-1.2B-Instruct β€” Apple Core AI (`.aimodel`)
**LiquidAI's LFM2.5-1.2B-Instruct converted to Apple's Core AI** (the Core ML successor
announced at WWDC26), ready to run on iOS 27 / macOS 27. A conv + full-attention hybrid
(10 short-conv mixers + 6 GQA attention layers) riding Apple's **`coreai-pipelined` GPU
engine** β€” the first non-Qwen architecture on that fast path, with zero custom kernels.
> Requires the iOS 27 / macOS 27 beta (Core AI ships with the OS). Conversion code, knowledge
> base, and the Swift runner: **[coreai-model-zoo](https://github.com/john-rocky/coreai-model-zoo)**.
<!-- gen-cards:use-it begin id=lfm2.5-1.2b (managed by scripts/gen-cards β€” edit cards.json / QuickStart.swift, not this block) -->
## Use it
▢️ **Run it (source)** β€” the [ChatDemo runner](https://github.com/john-rocky/coreai-kit/tree/main/Examples/ChatDemo)
(GUI + CLI, one app for every chat model in the catalog):
```bash
git clone https://github.com/john-rocky/coreai-kit
open coreai-kit/Examples/ChatDemo/ChatDemo.xcodeproj
# β†’ Run, then pick "LFM2.5 1.2B" in the model picker
# agents / headless (macOS):
cd coreai-kit/Examples/ChatDemo
swift run chat-cli --model lfm2.5-1.2b --prompt "What can you do, offline?"
```
πŸ’» **Build with it** β€” complete; the glue is kit API, copy-paste runs:
```swift
import CoreAIKit
let chat = try await ChatSession(catalog: "lfm2.5-1.2b")
let reply = try await chat.respond(to: prompt)
// reply: the answer, generated fully on-device
```
The take-home is [`Examples/ChatDemo/Sources/QuickStart.swift`](https://github.com/john-rocky/coreai-kit/blob/main/Examples/ChatDemo/Sources/QuickStart.swift)
β€” this exact code as one typed function, no UI; the CLI is an argument shell over it, and
the GUI drives the same `ChatSession` across turns for its transcript.
Multi-turn? Hold the `ChatSession` and call `respond(to:)` per turn β€” it keeps the
conversation history; `streamResponse(to:)` yields tokens as they decode.
**Integration checklist**
- SPM: `https://github.com/john-rocky/coreai-kit` β†’ product **CoreAIKit**
- Info.plist: none needed
- Entitlements: none needed
- First run downloads the model β€” 1.7 GB (Mac) / 1.7 GB (iPhone) β€” then it loads from the
local cache (Application Support; progress via the `downloadProgress` callback)
- Measure in Release β€” Debug is ~3Γ— slower on per-token host work
<!-- gen-cards:use-it end -->
## Measured (greedy; single-step top-1 gated 16/16 vs the fp32 Hugging Face oracle)
| Surface | Bundle | Prefill | Decode |
|---|---|---:|---:|
| **M4 Max**, release `llm-benchmark` | β˜…β˜…β˜… `gpu-pipelined/lfm2_5_1_2b_instruct_decode_int8hu_block32_sym/` (1.6 GB) | 277.8 tok/s | **276.5 tok/s** |
| **iPhone 17 Pro**, one-shot runner | β˜…β˜…β˜… same bundle | 44.2–46.6 | **44.1–46.6 tok/s** |
| M4 Max, release `llm-benchmark` | β˜…β˜… `gpu-pipelined/lfm2_5_1_2b_instruct_decode_int8lin/` (1.5 GB) | 253.3 tok/s | **253.3 tok/s** |
| iPhone 17 Pro, one-shot runner | β˜…β˜… same bundle | 39.2–39.4 | **38.0–39.6 tok/s** |
| iPhone 17 Pro, chat app (CoreAIChat LFM mode, 200-tok turn) | int8lin bundle | 30.7 | **35.8 tok/s** |
- **β˜…β˜…β˜… = the ship config** (`int8hu_block32_sym`): int8lin + the tied lm_head untied and
quantized **absmax per-block-32 int8** (`symmetric`, no clipping β€” clipping corrupts
big-vocab heads). +9% on M4 Max, **+15–20% on iPhone** (44.1–46.6 β‰ˆ ~94–98% of the naive
bandwidth ceiling, ~60 GB/s Γ· ~1.27 GB/token); warm engine load 0.3 s. Greedy rollouts are
token-identical to the int8lin bundle on both verification prompts; oracle gate 16/16 +
decode step, device numerics 24/24 ≑ Mac-GPU on all 3 runs.
- β˜…β˜… int8lin: the fp16-head variant (what CoreAIChat currently downloads); ~87% of its
ceiling on iPhone. Cold GPU specialization 6.8 s, warm load 1.6 s; no AOT compile needed.
- iPhone greedy sequences are **24/24 token-identical to the M4 Max GPU** on both fixed
verification prompts (both bundles).
- For scale: our Qwen3.5-0.8B on the same engine does 210 tok/s on M4 Max β€” this 1.2B does
276.5.
## What the bundle is
One full **LanguageBundle** (`.aimodel` + `tokenizer/` + `metadata.json`): decode-only graph,
`input_ids` static `[1,1]`, position_ids + KV seq dynamic (β†’ the engine factory selects
`coreai-pipelined`: async non-blocking encode, on-GPU argmax sampling, on-device KV growth).
Weights are **int8 linear per-block-32** (scale-multiply dequant β€” no LUT; k-means LUT
gathers measure slower on this GPU delegate) with the embedding, depthwise convs, norms,
and the four attention projections kept high-precision; in the β˜…β˜…β˜… bundle the lm_head is
untied and quantized absmax per-block-32 int8 too (in the β˜…β˜… bundle it stays fp16/tied).
Do NOT re-quantize the head per-channel: per-channel (axis-0) int8 weights are broken on
the current beta GPU delegate (garbage logits β€” delegate lowering bug, documented in the
zoo knowledge base). The attention projections
are fp32 on purpose: under a dynamic-shape graph the delegate's fp16 attention-prologue
matmuls lose ~1.3% relative accuracy, which LFM2.5's large q/k-norm gains amplify into wrong
logits β€” fp32 there restores layer-level exactness (+126 MB). Full write-up:
[`knowledge/pipelined-engine.md`](https://github.com/john-rocky/coreai-model-zoo/blob/main/knowledge/pipelined-engine.md).
## Run it
```bash
git clone https://github.com/john-rocky/coreai-model-zoo
git clone https://github.com/apple/coreai-models
git -C coreai-models apply ../coreai-model-zoo/apps/coreai-shared-product.patch \
../coreai-model-zoo/apps/coreai-pipelined-extra-states.patch
# (the extra-states patch lets the engine carry the conv state as a fixed-shape extra state)
# download this bundle into coreai-models/exports/, then:
cd coreai-models && swift build -c release
COREAI_CHUNK_THRESHOLD=1 ./.build/release/llm-benchmark \
--model exports/lfm2_5_1_2b_instruct_decode_int8hu_block32_sym -p 128 -g 256 -n 3
```
Run contract (each of these matters):
- `COREAI_CHUNK_THRESHOLD=1` **before engine creation** β€” prefill must run as pipelined S=1
steps (prompt tok/s β‰ˆ decode tok/s).
- **Never call `engine.warmup()`** on this S=1 bundle (it warms query length 256, which the
static `[1,1]` graph rejects). A 1-token generate after load is the warmup;
`llm-runner` needs `--warmup exact --warmup-length 1`.
- Benchmark **Release** builds only (a Debug engine measures ~3Γ— slow).
On iPhone, the [CoreAIChat sample app](https://github.com/john-rocky/coreai-model-zoo/tree/main/apps/CoreAIChat)
has an **LFM** picker mode that downloads this repo in-app and chats through this bundle.
## Conversion
Reproducible with
[`conversion/export_lfm2_decode_pipelined.py`](https://github.com/john-rocky/coreai-model-zoo/blob/main/conversion/export_lfm2_decode_pipelined.py)
(+ the `models/macos/lfm2.py` overlay) from the upstream HF checkpoint. Numerics are gated
the strict way: a teacher-forced S=1 sweep over a 16-position oracle prompt (top-1 vs the
fp32 HF reference at every position, 16/16 required) plus an oracle-cache-seeded decode
step β€” not long-rollout eyeballing. Model card with the full method and the GPU-delegate
findings: [`zoo/lfm2.5.md`](https://github.com/john-rocky/coreai-model-zoo/blob/main/zoo/lfm2.5.md).
## License
The model weights derive from
[LiquidAI/LFM2.5-1.2B-Instruct](https://huggingface.co/LiquidAI/LFM2.5-1.2B-Instruct) and are
redistributed under the **LFM Open License v1.0** (see [LICENSE](LICENSE)): Apache-style
grants, but **Commercial Use is licensed only for entities below US$10M annual revenue**
(qualified non-profits exempt for non-commercial/research use). The conversion code is
BSD-3-Clause (zoo repo).