gen-cards: regenerate Use-it block

c850513 verified 2 days ago

7.99 kB

	---
	license: other
	license_name: lfm1.0
	license_link: LICENSE
	base_model: LiquidAI/LFM2.5-1.2B-Instruct
	tags:
	- coreai
	- aimodel
	- apple-silicon
	- on-device
	- lfm2
	- hybrid
	pipeline_tag: text-generation
	---

	# LFM2.5-1.2B-Instruct — Apple Core AI (`.aimodel`)

	LiquidAI's LFM2.5-1.2B-Instruct converted to Apple's Core AI (the Core ML successor
	announced at WWDC26), ready to run on iOS 27 / macOS 27. A conv + full-attention hybrid
	(10 short-conv mixers + 6 GQA attention layers) riding Apple's **`coreai-pipelined` GPU
	engine** — the first non-Qwen architecture on that fast path, with zero custom kernels.

	> Requires the iOS 27 / macOS 27 beta (Core AI ships with the OS). Conversion code, knowledge
	> base, and the Swift runner: [coreai-model-zoo](https://github.com/john-rocky/coreai-model-zoo).

	<!-- gen-cards:use-it begin id=lfm2.5-1.2b (managed by scripts/gen-cards — edit cards.json / QuickStart.swift, not this block) -->
	## Use it

	▶️ Run it (source) — the [ChatDemo runner](https://github.com/john-rocky/coreai-kit/tree/main/Examples/ChatDemo)
	(GUI + CLI, one app for every chat model in the catalog):

	```bash
	git clone https://github.com/john-rocky/coreai-kit
	open coreai-kit/Examples/ChatDemo/ChatDemo.xcodeproj
	# → Run, then pick "LFM2.5 1.2B" in the model picker

	# agents / headless (macOS):
	cd coreai-kit/Examples/ChatDemo
	swift run chat-cli --model lfm2.5-1.2b --prompt "What can you do, offline?"
	```

	💻 Build with it — complete; the glue is kit API, copy-paste runs:

	```swift
	import CoreAIKit

	let chat = try await ChatSession(catalog: "lfm2.5-1.2b")
	let reply = try await chat.respond(to: prompt)
	// reply: the answer, generated fully on-device
	```

	The take-home is [`Examples/ChatDemo/Sources/QuickStart.swift`](https://github.com/john-rocky/coreai-kit/blob/main/Examples/ChatDemo/Sources/QuickStart.swift)
	— this exact code as one typed function, no UI; the CLI is an argument shell over it, and
	the GUI drives the same `ChatSession` across turns for its transcript.
	Multi-turn? Hold the `ChatSession` and call `respond(to:)` per turn — it keeps the
	conversation history; `streamResponse(to:)` yields tokens as they decode.

	Integration checklist

	- SPM: `https://github.com/john-rocky/coreai-kit` → product CoreAIKit
	- Info.plist: none needed
	- Entitlements: none needed
	- First run downloads the model — 1.7 GB (Mac) / 1.7 GB (iPhone) — then it loads from the
	local cache (Application Support; progress via the `downloadProgress` callback)
	- Measure in Release — Debug is ~3× slower on per-token host work
	<!-- gen-cards:use-it end -->

	## Measured (greedy; single-step top-1 gated 16/16 vs the fp32 Hugging Face oracle)

	\| Surface \| Bundle \| Prefill \| Decode \|
	\|---\|---\|---:\|---:\|
	\| M4 Max, release `llm-benchmark` \| ★★★ `gpu-pipelined/lfm2_5_1_2b_instruct_decode_int8hu_block32_sym/` (1.6 GB) \| 277.8 tok/s \| 276.5 tok/s \|
	\| iPhone 17 Pro, one-shot runner \| ★★★ same bundle \| 44.2–46.6 \| 44.1–46.6 tok/s \|
	\| M4 Max, release `llm-benchmark` \| ★★ `gpu-pipelined/lfm2_5_1_2b_instruct_decode_int8lin/` (1.5 GB) \| 253.3 tok/s \| 253.3 tok/s \|
	\| iPhone 17 Pro, one-shot runner \| ★★ same bundle \| 39.2–39.4 \| 38.0–39.6 tok/s \|
	\| iPhone 17 Pro, chat app (CoreAIChat LFM mode, 200-tok turn) \| int8lin bundle \| 30.7 \| 35.8 tok/s \|

	- ★★★ = the ship config (`int8hu_block32_sym`): int8lin + the tied lm_head untied and
	quantized absmax per-block-32 int8 (`symmetric`, no clipping — clipping corrupts
	big-vocab heads). +9% on M4 Max, +15–20% on iPhone (44.1–46.6 ≈ ~94–98% of the naive
	bandwidth ceiling, ~60 GB/s ÷ ~1.27 GB/token); warm engine load 0.3 s. Greedy rollouts are
	token-identical to the int8lin bundle on both verification prompts; oracle gate 16/16 +
	decode step, device numerics 24/24 ≡ Mac-GPU on all 3 runs.
	- ★★ int8lin: the fp16-head variant (what CoreAIChat currently downloads); ~87% of its
	ceiling on iPhone. Cold GPU specialization 6.8 s, warm load 1.6 s; no AOT compile needed.
	- iPhone greedy sequences are 24/24 token-identical to the M4 Max GPU on both fixed
	verification prompts (both bundles).
	- For scale: our Qwen3.5-0.8B on the same engine does 210 tok/s on M4 Max — this 1.2B does
	276.5.

	## What the bundle is

	One full LanguageBundle (`.aimodel` + `tokenizer/` + `metadata.json`): decode-only graph,
	`input_ids` static `[1,1]`, position_ids + KV seq dynamic (→ the engine factory selects
	`coreai-pipelined`: async non-blocking encode, on-GPU argmax sampling, on-device KV growth).
	Weights are int8 linear per-block-32 (scale-multiply dequant — no LUT; k-means LUT
	gathers measure slower on this GPU delegate) with the embedding, depthwise convs, norms,
	and the four attention projections kept high-precision; in the ★★★ bundle the lm_head is
	untied and quantized absmax per-block-32 int8 too (in the ★★ bundle it stays fp16/tied).
	Do NOT re-quantize the head per-channel: per-channel (axis-0) int8 weights are broken on
	the current beta GPU delegate (garbage logits — delegate lowering bug, documented in the
	zoo knowledge base). The attention projections
	are fp32 on purpose: under a dynamic-shape graph the delegate's fp16 attention-prologue
	matmuls lose ~1.3% relative accuracy, which LFM2.5's large q/k-norm gains amplify into wrong
	logits — fp32 there restores layer-level exactness (+126 MB). Full write-up:
	[`knowledge/pipelined-engine.md`](https://github.com/john-rocky/coreai-model-zoo/blob/main/knowledge/pipelined-engine.md).

	## Run it

	```bash
	git clone https://github.com/john-rocky/coreai-model-zoo
	git clone https://github.com/apple/coreai-models
	git -C coreai-models apply ../coreai-model-zoo/apps/coreai-shared-product.patch \
	../coreai-model-zoo/apps/coreai-pipelined-extra-states.patch
	# (the extra-states patch lets the engine carry the conv state as a fixed-shape extra state)

	# download this bundle into coreai-models/exports/, then:
	cd coreai-models && swift build -c release
	COREAI_CHUNK_THRESHOLD=1 ./.build/release/llm-benchmark \
	--model exports/lfm2_5_1_2b_instruct_decode_int8hu_block32_sym -p 128 -g 256 -n 3
	```

	Run contract (each of these matters):
	- `COREAI_CHUNK_THRESHOLD=1` before engine creation — prefill must run as pipelined S=1
	steps (prompt tok/s ≈ decode tok/s).
	- Never call `engine.warmup()` on this S=1 bundle (it warms query length 256, which the
	static `[1,1]` graph rejects). A 1-token generate after load is the warmup;
	`llm-runner` needs `--warmup exact --warmup-length 1`.
	- Benchmark Release builds only (a Debug engine measures ~3× slow).

	On iPhone, the [CoreAIChat sample app](https://github.com/john-rocky/coreai-model-zoo/tree/main/apps/CoreAIChat)
	has an LFM picker mode that downloads this repo in-app and chats through this bundle.

	## Conversion

	Reproducible with
	[`conversion/export_lfm2_decode_pipelined.py`](https://github.com/john-rocky/coreai-model-zoo/blob/main/conversion/export_lfm2_decode_pipelined.py)
	(+ the `models/macos/lfm2.py` overlay) from the upstream HF checkpoint. Numerics are gated
	the strict way: a teacher-forced S=1 sweep over a 16-position oracle prompt (top-1 vs the
	fp32 HF reference at every position, 16/16 required) plus an oracle-cache-seeded decode
	step — not long-rollout eyeballing. Model card with the full method and the GPU-delegate
	findings: [`zoo/lfm2.5.md`](https://github.com/john-rocky/coreai-model-zoo/blob/main/zoo/lfm2.5.md).

	## License

	The model weights derive from
	[LiquidAI/LFM2.5-1.2B-Instruct](https://huggingface.co/LiquidAI/LFM2.5-1.2B-Instruct) and are
	redistributed under the LFM Open License v1.0 (see [LICENSE](LICENSE)): Apache-style
	grants, but Commercial Use is licensed only for entities below US$10M annual revenue
	(qualified non-profits exempt for non-commercial/research use). The conversion code is
	BSD-3-Clause (zoo repo).