license: other
license_name: lfm1.0
license_link: LICENSE
base_model: LiquidAI/LFM2.5-1.2B-Instruct
tags:
- coreai
- aimodel
- apple-silicon
- on-device
- lfm2
- hybrid
pipeline_tag: text-generation
LFM2.5-1.2B-Instruct β Apple Core AI (.aimodel)
LiquidAI's LFM2.5-1.2B-Instruct converted to Apple's Core AI (the Core ML successor
announced at WWDC26), ready to run on iOS 27 / macOS 27. A conv + full-attention hybrid
(10 short-conv mixers + 6 GQA attention layers) riding Apple's coreai-pipelined GPU
engine β the first non-Qwen architecture on that fast path, with zero custom kernels.
Requires the iOS 27 / macOS 27 beta (Core AI ships with the OS). Conversion code, knowledge base, and the Swift runner: coreai-model-zoo.
Use it
βΆοΈ Run it (source) β the ChatDemo runner (GUI + CLI, one app for every chat model in the catalog):
git clone https://github.com/john-rocky/coreai-kit
open coreai-kit/Examples/ChatDemo/ChatDemo.xcodeproj
# β Run, then pick "LFM2.5 1.2B" in the model picker
# agents / headless (macOS):
cd coreai-kit/Examples/ChatDemo
swift run chat-cli --model lfm2.5-1.2b --prompt "What can you do, offline?"
π» Build with it β complete; the glue is kit API, copy-paste runs:
import CoreAIKit
let chat = try await ChatSession(catalog: "lfm2.5-1.2b")
let reply = try await chat.respond(to: prompt)
// reply: the answer, generated fully on-device
The take-home is Examples/ChatDemo/Sources/QuickStart.swift
β this exact code as one typed function, no UI; the CLI is an argument shell over it, and
the GUI drives the same ChatSession across turns for its transcript.
Multi-turn? Hold the ChatSession and call respond(to:) per turn β it keeps the
conversation history; streamResponse(to:) yields tokens as they decode.
Integration checklist
- SPM:
https://github.com/john-rocky/coreai-kitβ product CoreAIKit - Info.plist: none needed
- Entitlements: none needed
- First run downloads the model β 1.7 GB (Mac) / 1.7 GB (iPhone) β then it loads from the
local cache (Application Support; progress via the
downloadProgresscallback) - Measure in Release β Debug is ~3Γ slower on per-token host work
Measured (greedy; single-step top-1 gated 16/16 vs the fp32 Hugging Face oracle)
| Surface | Bundle | Prefill | Decode |
|---|---|---|---|
M4 Max, release llm-benchmark |
β
β
β
gpu-pipelined/lfm2_5_1_2b_instruct_decode_int8hu_block32_sym/ (1.6 GB) |
277.8 tok/s | 276.5 tok/s |
| iPhone 17 Pro, one-shot runner | β β β same bundle | 44.2β46.6 | 44.1β46.6 tok/s |
M4 Max, release llm-benchmark |
β
β
gpu-pipelined/lfm2_5_1_2b_instruct_decode_int8lin/ (1.5 GB) |
253.3 tok/s | 253.3 tok/s |
| iPhone 17 Pro, one-shot runner | β β same bundle | 39.2β39.4 | 38.0β39.6 tok/s |
| iPhone 17 Pro, chat app (CoreAIChat LFM mode, 200-tok turn) | int8lin bundle | 30.7 | 35.8 tok/s |
- β
β
β
= the ship config (
int8hu_block32_sym): int8lin + the tied lm_head untied and quantized absmax per-block-32 int8 (symmetric, no clipping β clipping corrupts big-vocab heads). +9% on M4 Max, +15β20% on iPhone (44.1β46.6 β ~94β98% of the naive bandwidth ceiling, ~60 GB/s Γ· ~1.27 GB/token); warm engine load 0.3 s. Greedy rollouts are token-identical to the int8lin bundle on both verification prompts; oracle gate 16/16 + decode step, device numerics 24/24 β‘ Mac-GPU on all 3 runs. - β β int8lin: the fp16-head variant (what CoreAIChat currently downloads); ~87% of its ceiling on iPhone. Cold GPU specialization 6.8 s, warm load 1.6 s; no AOT compile needed.
- iPhone greedy sequences are 24/24 token-identical to the M4 Max GPU on both fixed verification prompts (both bundles).
- For scale: our Qwen3.5-0.8B on the same engine does 210 tok/s on M4 Max β this 1.2B does 276.5.
What the bundle is
One full LanguageBundle (.aimodel + tokenizer/ + metadata.json): decode-only graph,
input_ids static [1,1], position_ids + KV seq dynamic (β the engine factory selects
coreai-pipelined: async non-blocking encode, on-GPU argmax sampling, on-device KV growth).
Weights are int8 linear per-block-32 (scale-multiply dequant β no LUT; k-means LUT
gathers measure slower on this GPU delegate) with the embedding, depthwise convs, norms,
and the four attention projections kept high-precision; in the β
β
β
bundle the lm_head is
untied and quantized absmax per-block-32 int8 too (in the β
β
bundle it stays fp16/tied).
Do NOT re-quantize the head per-channel: per-channel (axis-0) int8 weights are broken on
the current beta GPU delegate (garbage logits β delegate lowering bug, documented in the
zoo knowledge base). The attention projections
are fp32 on purpose: under a dynamic-shape graph the delegate's fp16 attention-prologue
matmuls lose ~1.3% relative accuracy, which LFM2.5's large q/k-norm gains amplify into wrong
logits β fp32 there restores layer-level exactness (+126 MB). Full write-up:
knowledge/pipelined-engine.md.
Run it
git clone https://github.com/john-rocky/coreai-model-zoo
git clone https://github.com/apple/coreai-models
git -C coreai-models apply ../coreai-model-zoo/apps/coreai-shared-product.patch \
../coreai-model-zoo/apps/coreai-pipelined-extra-states.patch
# (the extra-states patch lets the engine carry the conv state as a fixed-shape extra state)
# download this bundle into coreai-models/exports/, then:
cd coreai-models && swift build -c release
COREAI_CHUNK_THRESHOLD=1 ./.build/release/llm-benchmark \
--model exports/lfm2_5_1_2b_instruct_decode_int8hu_block32_sym -p 128 -g 256 -n 3
Run contract (each of these matters):
COREAI_CHUNK_THRESHOLD=1before engine creation β prefill must run as pipelined S=1 steps (prompt tok/s β decode tok/s).- Never call
engine.warmup()on this S=1 bundle (it warms query length 256, which the static[1,1]graph rejects). A 1-token generate after load is the warmup;llm-runnerneeds--warmup exact --warmup-length 1. - Benchmark Release builds only (a Debug engine measures ~3Γ slow).
On iPhone, the CoreAIChat sample app has an LFM picker mode that downloads this repo in-app and chats through this bundle.
Conversion
Reproducible with
conversion/export_lfm2_decode_pipelined.py
(+ the models/macos/lfm2.py overlay) from the upstream HF checkpoint. Numerics are gated
the strict way: a teacher-forced S=1 sweep over a 16-position oracle prompt (top-1 vs the
fp32 HF reference at every position, 16/16 required) plus an oracle-cache-seeded decode
step β not long-rollout eyeballing. Model card with the full method and the GPU-delegate
findings: zoo/lfm2.5.md.
License
The model weights derive from LiquidAI/LFM2.5-1.2B-Instruct and are redistributed under the LFM Open License v1.0 (see LICENSE): Apache-style grants, but Commercial Use is licensed only for entities below US$10M annual revenue (qualified non-profits exempt for non-commercial/research use). The conversion code is BSD-3-Clause (zoo repo).