| --- |
| license: other |
| license_name: lfm1.0 |
| license_link: https://huggingface.co/LiquidAI/LFM2.5-350M/blob/main/LICENSE |
| base_model: LiquidAI/LFM2.5-350M |
| tags: |
| - coreml |
| - ane |
| - lfm2 |
| - on-device |
| - iphone |
| language: |
| - en |
| - ja |
| library_name: coreml |
| pipeline_tag: text-generation |
| --- |
| |
| ## Use it from Swift |
|
|
| <!-- swift-usage-begin --> |
| ### Add the package |
|
|
| `Package.swift`: |
|
|
| ```swift |
| .package(url: "https://github.com/john-rocky/CoreML-LLM", branch: "main"), |
| |
| // In your target: |
| .product(name: "CoreMLLLM", package: "CoreML-LLM"), |
| ``` |
|
|
| Platforms: iOS 18+ / macOS 15+. |
|
|
| ### Download + chat (one call) |
|
|
| ```swift |
| import CoreMLLLM |
| |
| // First call pulls the bundle from this repo to Documents/Models/. |
| // Subsequent calls reuse the on-disk copy. |
| let llm = try await CoreMLLLM.load(repo: "mlboydaisuke/lfm2.5-350m-coreml") |
| |
| let stream = try await llm.generate( |
| [CoreMLLLM.Message(role: .user, content: "Hello!")], |
| maxTokens: 256 |
| ) |
| for await chunk in stream { |
| print(chunk, terminator: "") |
| } |
| ``` |
|
|
| Multi-turn: keep an `[CoreMLLLM.Message]` array, append the |
| user/assistant turns, and pass the whole history to |
| `generate(_:)` again. Call `llm.reset()` to start a new |
| conversation (clears the KV cache). |
| <!-- swift-usage-end --> |
|
|
|
|
|
|
| # LFM2.5 350M — CoreML build for Apple Neural Engine |
|
|
| CoreML port of [LiquidAI/LFM2.5-350M](https://huggingface.co/LiquidAI/LFM2.5-350M) |
| for the [CoreML-LLM](https://github.com/john-rocky/CoreML-LLM) iOS / macOS |
| runtime. fp16, 97.8 % ANE-resident, **52 tok/s decode on iPhone 17 Pro**. |
|
|
| ## Use it from Swift |
|
|
| ### 1. Add the package |
|
|
| `Package.swift`: |
|
|
| ```swift |
| .package(url: "https://github.com/john-rocky/CoreML-LLM", branch: "main"), |
| |
| // In your target: |
| .product(name: "CoreMLLLM", package: "CoreML-LLM"), |
| ``` |
|
|
| Platforms: iOS 18+ / macOS 15+. |
|
|
| ### 2. Download + chat (one-turn streaming) |
|
|
| ```swift |
| import CoreMLLLM |
| |
| let info = ModelDownloader.ModelInfo.lfm2_5_350m |
| let downloader = ModelDownloader.shared |
| |
| // First launch: pulls ~810 MB from this repo to |
| // Documents/Models/lfm2.5-350m/. Subsequent launches no-op. |
| if !downloader.isDownloaded(info) { |
| _ = try await downloader.download(info) |
| } |
| |
| let modelDir = downloader.localModelURL(for: info)! |
| .deletingLastPathComponent() // bundle root (parent of model.mlmodelc) |
| |
| let llm = try await CoreMLLLM.load(from: modelDir) |
| |
| let stream = try await llm.generate( |
| [CoreMLLLM.Message(role: .user, content: "Hello!")], |
| maxTokens: 256 |
| ) |
| for await chunk in stream { |
| print(chunk, terminator: "") |
| } |
| ``` |
|
|
| ### 3. Multi-turn chat |
|
|
| ```swift |
| var history: [CoreMLLLM.Message] = [ |
| .init(role: .system, content: "You are a concise assistant."), |
| ] |
| |
| func reply(to user: String) async throws -> String { |
| history.append(.init(role: .user, content: user)) |
| var out = "" |
| let stream = try await llm.generate(history, maxTokens: 512) |
| for await chunk in stream { |
| out += chunk |
| print(chunk, terminator: "") |
| } |
| history.append(.init(role: .assistant, content: out)) |
| return out |
| } |
| |
| llm.reset() // start a fresh conversation (clears KV + conv state) |
| ``` |
|
|
| `CoreMLLLM.load()` honours the model's ChatML template, the |
| `<|im_end|>` / `<|endoftext|>` EOS tokens, and the conv-state I/O |
| contract automatically — you don't pass any of that yourself. |
|
|
| ### 4. Compute units |
|
|
| Defaults to `.cpuAndNeuralEngine` (the 52 tok/s number above). |
| Override at load time: |
|
|
| ```swift |
| let llm = try await CoreMLLLM.load( |
| from: modelDir, |
| computeUnits: .cpuOnly, // or .all / .cpuAndGPU |
| ) |
| ``` |
|
|
| Or via env (only affects LFM2): |
|
|
| ```swift |
| setenv("LLM_LFM2_USE_CPU", "1", 1) |
| ``` |
|
|
| ## App: CoreMLLLMChat |
|
|
| If you just want to try it without writing code, the example app |
| ([Examples/CoreMLLLMChat](https://github.com/john-rocky/CoreML-LLM/tree/main/Examples/CoreMLLLMChat)) |
| ships an LFM2.5 350M (ANE) entry in its model picker — open the |
| project in Xcode, run on a device, tap **Download**. |
|
|
| ## Sideload (Mac → iPhone, no in-app download) |
|
|
| For development / offline use: |
|
|
| ```bash |
| DEVICE=$(xcrun devicectl list devices | awk '/connected/{print $3}' | head -1) |
| xcrun devicectl device copy to --device "$DEVICE" \ |
| --domain-type appDataContainer \ |
| --domain-identifier com.example.CoreMLLLMChat \ |
| --source ./lfm2.5-350m-coreml \ |
| --destination Documents/Models/lfm2.5-350m \ |
| --remove-existing-content true |
| ``` |
|
|
| Note: `xcrun devicectl` writes files as UID 0 / 0755, which the app |
| sandbox can't unlink later — the picker's trash button will fail with |
| a permission error. To clear a sideloaded copy run |
| [scripts/uninstall_sideloaded_model.sh](https://github.com/john-rocky/CoreML-LLM/blob/main/scripts/uninstall_sideloaded_model.sh) |
| from the host or uninstall the app to wipe the container. |
|
|
| ## Files in this repo |
|
|
| ``` |
| model.mlmodelc/ compiled model — load via MLModel(contentsOf:) |
| model_config.json context_length, num_hidden_layers, lfm2_conv_l_pad … |
| hf_model/ tokenizer (ChatML, sanitised for swift-transformers) |
| ``` |
|
|
| ## Architecture notes |
|
|
| * Hybrid: 6 attention layers (GQA + RoPE + QK-norm) + 10 short-conv |
| layers (depthwise causal Conv1d, kernel = 3). |
| * The conv-state rolling window is a regular **input/output tensor**, |
| not an MLState — the M-series ANE planner rejects the dual-state |
| combination (`kv_cache_0` + `conv_cache_0`) at predict-time |
| (`status=0x1d`). |
| * `L_pad = conv_L_cache = 3`. An earlier 16-wide padding fed enough |
| fp16 noise into the depthwise reduction that autoregressive output |
| collapsed to "kingkingking…" within a few tokens. Dropping the |
| padding fixed both correctness and ANE compatibility. |
| * Compute precision is the default fp16 — no fp32 fallback needed |
| once the padding is fixed. |
| * Chat template: ChatML (`<|im_start|>role\n…<|im_end|>\n`) wrapped |
| in `<|startoftext|>`. EOS = `<|im_end|>` (id 7) and `<|endoftext|>` |
| (id 2). |
|
|
| Full conversion + drift writeup: |
| [docs/LFM2_CONVERSION_FINDINGS.md](https://github.com/john-rocky/CoreML-LLM/blob/main/docs/LFM2_CONVERSION_FINDINGS.md). |
|
|
| ## License |
|
|
| This CoreML port inherits **[LFM Open License v1.0](https://huggingface.co/LiquidAI/LFM2.5-350M/blob/main/LICENSE)** from the [base model](https://huggingface.co/LiquidAI/LFM2.5-350M). |
|
|
| **Important — commercial use limit**: the LFM Open License grants free |
| commercial use only up to a **revenue threshold of US $10M / year**. |
| Above that threshold (and for non-501(c)(3) entities) you need a |
| separate commercial license from Liquid AI. See the upstream |
| [LICENSE](https://huggingface.co/LiquidAI/LFM2.5-350M/blob/main/LICENSE) |
| and [Liquid AI commercial licensing](https://www.liquid.ai/) for |
| details. |
|
|
| The CoreML conversion code in this repo (the model class, conversion |
| scripts, runtime glue) is Apache 2.0 (parent project |
| [CoreML-LLM](https://github.com/john-rocky/CoreML-LLM)). |
|
|