alan13367
/

LFM2.5-1.2B-Instruct-CoreML

Text Generation

apple-neural-engine

Model card Files Files and versions

LFM2.5-1.2B-Instruct-CoreML / README.md

alan13367's picture

Upload README.md with huggingface_hub

ef2d3e6 verified about 1 month ago

|

history blame contribute delete

2.71 kB

	---
	language:
	- en
	library_name: coreml
	pipeline_tag: text-generation
	tags:
	- lfm
	- liquid
	- coreml
	- apple-neural-engine
	- ane
	- on-device
	---

	# LFM 2.5 1.2B Instruct - Core ML (ANE)

	This is an experimental Core ML export of [LiquidAI/LFM2.5-1.2B-Instruct](https://huggingface.co/LiquidAI/LFM2.5-1.2B-Instruct), specifically optimized and structured for the Apple Neural Engine (ANE) using Core ML 7's Stateful API.

	## Model Details
	- Architecture: Liquid Foundation Model (LFM) - LIV Convolution + Full Attention Hybrid
	- Size: 1.2B Parameters
	- Quantization: 4-bit Linear Symmetric (INT4 weights)
	- Target Runtime: Core ML / Apple Neural Engine (iOS 18+ / macOS 15+)
	- Cache Handling: Native `MLState` (Stateful Core ML) with fixed sequence length bounds.

	## Integration & Export Details
	This model has been adapted from its original PyTorch format because the native `LIV Convolution` state management dynamically concats cache tensors over time, an operation that is incompatible with the ANE's static memory requirements.

	To solve this, the export pipeline applied the following transformations:
	1. Static Buffer Allocation: The rolling `conv_cache` and standard attention `key_value` caches are allocated to fixed bounds (e.g. `MAX_SEQ_LEN = 512`) at initialization.
	2. In-Place Updates: Dynamic slice concatenation was monkey-patched to use in-place slice assignment (`tensor[:] = ...` and `tensor[:, :, cache_position, :] = ...`).
	3. Core ML 7 State Mapping: These buffers are registered as `ct.StateType` inputs/outputs during `coremltools` conversion so the Swift runtime can handle them efficiently as `MLState` opaque handles.
	4. INT4 Quantization: The linear layers have been quantized to 4-bit to fit within strict iOS Jetsam limits on 8GB devices.

	## Usage in Swift
	This model must be invoked using `MLState` instead of passing the caches explicitly:

	```swift
	import CoreML

	let config = MLModelConfiguration()
	config.computeUnits = .cpuAndGPU // or .all, though ANE compile success may vary by iOS patch
	let model = try await LFM2_5_1_2B_Stateful(configuration: config)

	let state = model.makeState()

	// Token generation loop
	let input = LFM2_5_1_2B_StatefulInput(
	input_ids: currentTokenArray,
	cache_position: cachePositionArray,
	attention_mask: attentionMaskArray
	)

	let output = try await model.prediction(input: input, using: state)
	```

	## Intended Use
	This repository was compiled for use inside [iMLX](https://github.com/alan13367/iMLX) (an experimental local inference chat app for iOS). It includes the original Hugging Face `tokenizer.json` and a specific `model_config.json` designed for the app's `ModelDownloadService`.