File size: 6,760 Bytes
25f89bc
d2a7fcd
 
 
25f89bc
 
 
 
 
 
 
 
 
 
 
 
 
 
4731e2b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25f89bc
 
 
 
b5694e0
25f89bc
b5694e0
25f89bc
b5694e0
 
 
 
 
 
 
 
 
25f89bc
b5694e0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25f89bc
 
b5694e0
25f89bc
b5694e0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25f89bc
 
 
 
 
 
 
 
 
 
 
b5694e0
 
 
 
 
25f89bc
b5694e0
25f89bc
b5694e0
 
 
 
25f89bc
 
 
 
 
 
b5694e0
 
 
 
 
25f89bc
b5694e0
 
 
 
 
 
 
25f89bc
b5694e0
25f89bc
 
 
 
d2a7fcd
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
---
license: other
license_name: lfm1.0
license_link: https://huggingface.co/LiquidAI/LFM2.5-350M/blob/main/LICENSE
base_model: LiquidAI/LFM2.5-350M
tags:
- coreml
- ane
- lfm2
- on-device
- iphone
language:
- en
- ja
library_name: coreml
pipeline_tag: text-generation
---

## Use it from Swift

<!-- swift-usage-begin -->
### Add the package

`Package.swift`:

```swift
.package(url: "https://github.com/john-rocky/CoreML-LLM", branch: "main"),

// In your target:
.product(name: "CoreMLLLM", package: "CoreML-LLM"),
```

Platforms: iOS 18+ / macOS 15+.

### Download + chat (one call)

```swift
import CoreMLLLM

// First call pulls the bundle from this repo to Documents/Models/.
// Subsequent calls reuse the on-disk copy.
let llm = try await CoreMLLLM.load(repo: "mlboydaisuke/lfm2.5-350m-coreml")

let stream = try await llm.generate(
    [CoreMLLLM.Message(role: .user, content: "Hello!")],
    maxTokens: 256
)
for await chunk in stream {
    print(chunk, terminator: "")
}
```

Multi-turn: keep an `[CoreMLLLM.Message]` array, append the
user/assistant turns, and pass the whole history to
`generate(_:)` again.  Call `llm.reset()` to start a new
conversation (clears the KV cache).
<!-- swift-usage-end -->



# LFM2.5 350M — CoreML build for Apple Neural Engine

CoreML port of [LiquidAI/LFM2.5-350M](https://huggingface.co/LiquidAI/LFM2.5-350M)
for the [CoreML-LLM](https://github.com/john-rocky/CoreML-LLM) iOS / macOS
runtime. fp16, 97.8 % ANE-resident, **52 tok/s decode on iPhone 17 Pro**.

## Use it from Swift

### 1. Add the package

`Package.swift`:

```swift
.package(url: "https://github.com/john-rocky/CoreML-LLM", branch: "main"),

// In your target:
.product(name: "CoreMLLLM", package: "CoreML-LLM"),
```

Platforms: iOS 18+ / macOS 15+.

### 2. Download + chat (one-turn streaming)

```swift
import CoreMLLLM

let info = ModelDownloader.ModelInfo.lfm2_5_350m
let downloader = ModelDownloader.shared

// First launch: pulls ~810 MB from this repo to
// Documents/Models/lfm2.5-350m/.  Subsequent launches no-op.
if !downloader.isDownloaded(info) {
    _ = try await downloader.download(info)
}

let modelDir = downloader.localModelURL(for: info)!
    .deletingLastPathComponent()  // bundle root (parent of model.mlmodelc)

let llm = try await CoreMLLLM.load(from: modelDir)

let stream = try await llm.generate(
    [CoreMLLLM.Message(role: .user, content: "Hello!")],
    maxTokens: 256
)
for await chunk in stream {
    print(chunk, terminator: "")
}
```

### 3. Multi-turn chat

```swift
var history: [CoreMLLLM.Message] = [
    .init(role: .system, content: "You are a concise assistant."),
]

func reply(to user: String) async throws -> String {
    history.append(.init(role: .user, content: user))
    var out = ""
    let stream = try await llm.generate(history, maxTokens: 512)
    for await chunk in stream {
        out += chunk
        print(chunk, terminator: "")
    }
    history.append(.init(role: .assistant, content: out))
    return out
}

llm.reset()  // start a fresh conversation (clears KV + conv state)
```

`CoreMLLLM.load()` honours the model's ChatML template, the
`<|im_end|>` / `<|endoftext|>` EOS tokens, and the conv-state I/O
contract automatically — you don't pass any of that yourself.

### 4. Compute units

Defaults to `.cpuAndNeuralEngine` (the 52 tok/s number above).
Override at load time:

```swift
let llm = try await CoreMLLLM.load(
    from: modelDir,
    computeUnits: .cpuOnly,  // or .all / .cpuAndGPU
)
```

Or via env (only affects LFM2):

```swift
setenv("LLM_LFM2_USE_CPU", "1", 1)
```

## App: CoreMLLLMChat

If you just want to try it without writing code, the example app
([Examples/CoreMLLLMChat](https://github.com/john-rocky/CoreML-LLM/tree/main/Examples/CoreMLLLMChat))
ships an LFM2.5 350M (ANE) entry in its model picker — open the
project in Xcode, run on a device, tap **Download**.

## Sideload (Mac → iPhone, no in-app download)

For development / offline use:

```bash
DEVICE=$(xcrun devicectl list devices | awk '/connected/{print $3}' | head -1)
xcrun devicectl device copy to --device "$DEVICE" \
    --domain-type appDataContainer \
    --domain-identifier com.example.CoreMLLLMChat \
    --source ./lfm2.5-350m-coreml \
    --destination Documents/Models/lfm2.5-350m \
    --remove-existing-content true
```

Note: `xcrun devicectl` writes files as UID 0 / 0755, which the app
sandbox can't unlink later — the picker's trash button will fail with
a permission error. To clear a sideloaded copy run
[scripts/uninstall_sideloaded_model.sh](https://github.com/john-rocky/CoreML-LLM/blob/main/scripts/uninstall_sideloaded_model.sh)
from the host or uninstall the app to wipe the container.

## Files in this repo

```
model.mlmodelc/      compiled model — load via MLModel(contentsOf:)
model_config.json    context_length, num_hidden_layers, lfm2_conv_l_pad …
hf_model/            tokenizer (ChatML, sanitised for swift-transformers)
```

## Architecture notes

* Hybrid: 6 attention layers (GQA + RoPE + QK-norm) + 10 short-conv
  layers (depthwise causal Conv1d, kernel = 3).
* The conv-state rolling window is a regular **input/output tensor**,
  not an MLState — the M-series ANE planner rejects the dual-state
  combination (`kv_cache_0` + `conv_cache_0`) at predict-time
  (`status=0x1d`).
* `L_pad = conv_L_cache = 3`. An earlier 16-wide padding fed enough
  fp16 noise into the depthwise reduction that autoregressive output
  collapsed to "kingkingking…" within a few tokens. Dropping the
  padding fixed both correctness and ANE compatibility.
* Compute precision is the default fp16 — no fp32 fallback needed
  once the padding is fixed.
* Chat template: ChatML (`<|im_start|>role\n…<|im_end|>\n`) wrapped
  in `<|startoftext|>`. EOS = `<|im_end|>` (id 7) and `<|endoftext|>`
  (id 2).

Full conversion + drift writeup:
[docs/LFM2_CONVERSION_FINDINGS.md](https://github.com/john-rocky/CoreML-LLM/blob/main/docs/LFM2_CONVERSION_FINDINGS.md).

## License

This CoreML port inherits **[LFM Open License v1.0](https://huggingface.co/LiquidAI/LFM2.5-350M/blob/main/LICENSE)** from the [base model](https://huggingface.co/LiquidAI/LFM2.5-350M).

**Important — commercial use limit**: the LFM Open License grants free
commercial use only up to a **revenue threshold of US $10M / year**.
Above that threshold (and for non-501(c)(3) entities) you need a
separate commercial license from Liquid AI. See the upstream
[LICENSE](https://huggingface.co/LiquidAI/LFM2.5-350M/blob/main/LICENSE)
and [Liquid AI commercial licensing](https://www.liquid.ai/) for
details.

The CoreML conversion code in this repo (the model class, conversion
scripts, runtime glue) is Apache 2.0 (parent project
[CoreML-LLM](https://github.com/john-rocky/CoreML-LLM)).