Add int8 CoreML multilingual-e5-small for Hark vault RAG

0a386d4 verified 4 days ago

2.96 kB

	---
	license: mit
	language:
	- multilingual
	- en
	- vi
	- th
	library_name: coreml
	tags:
	- coreml
	- embeddings
	- sentence-similarity
	- retrieval
	- hark
	base_model: intfloat/multilingual-e5-small
	---

	# multilingual-e5-small — CoreML (int8) for Hark

	A CoreML (`mlprogram`, int8-weight-quantized) conversion of
	[`intfloat/multilingual-e5-small`](https://huggingface.co/intfloat/multilingual-e5-small)
	(384-dim, multilingual), packaged for on-device vault search in
	[Hark](https://github.com/tuanda2912/hark) — a local-first, macOS-only meeting
	transcription app. Runs on the Apple Neural Engine; **nothing is sent off the
	machine** (Hark embeds the whole vault locally).

	This repo exists so Hark can download a ready-to-run CoreML artifact instead of
	shipping it in the app bundle. It is a faithful conversion — see Provenance and
	Validation — not a new model.

	## Files

	\| File \| What \|
	\|---\|---\|
	\| `MultilingualE5Small.mlpackage/` \| the CoreML model (int8 weights, ~113 MB) \|
	\| `tokenizer.json`, `tokenizer_config.json`, `special_tokens_map.json` \| XLM-RoBERTa tokenizer (SentencePiece Unigram) \|
	\| `sentencepiece.bpe.model` \| the SentencePiece model \|

	Hark's loader snapshots this repo at a pinned revision into its app-support
	models dir, compiles the `.mlpackage` to the ANE, and runs fully offline
	thereafter.

	## I/O contract

	- inputs: `input_ids` (int32 `[1, L]`), `attention_mask` (int32 `[1, L]`), flexible `L ∈ 1..512`
	- output: `last_hidden_state` (float32 `[1, L, 384]`)
	- Hark applies masked mean-pooling + L2-normalization in Swift, and the e5
	asymmetric prefixes (`"query: "` / `"passage: "`). Reproduce those if you reuse
	this model directly.

	## Provenance

	- Converted from `intfloat/multilingual-e5-small` at source revision
	`614241f622f53c4eeff9890bdc4f31cfecc418b3` via
	[`engine/scripts/convert-embedder-coreml.py`](https://github.com/tuanda2912/hark/blob/main/engine/scripts/convert-embedder-coreml.py)
	(coremltools 9, `convert_to="mlprogram"`, `minimum_deployment_target=macOS14`).
	- int8 weight quantization (per-channel, symmetric) via
	[`engine/scripts/quantize-embedder-int8.py`](https://github.com/tuanda2912/hark/blob/main/engine/scripts/quantize-embedder-int8.py)
	(`coremltools.optimize.coreml.linear_quantize_weights`).

	## Validation

	- Fidelity: worst-case cosine between the fp16 and int8 pooled+L2-normalized
	embeddings was 0.99986 across EN/VI/TH probe sentences — the int8 weights
	are essentially indistinguishable from fp16 for retrieval.
	- On-device: Hark's gated cross-lingual + end-to-end retrieval tests pass on
	the Apple Neural Engine with this int8 artifact (EN↔VI/TH closer than
	unrelated; full chunk → embed → index → retrieve pipeline).

	## License

	MIT, inherited from `intfloat/multilingual-e5-small`. This is a format conversion
	+ weight quantization of that model; all credit to the original authors.