tuanda2912's picture
Add int8 CoreML multilingual-e5-small for Hark vault RAG
0a386d4 verified
---
license: mit
language:
- multilingual
- en
- vi
- th
library_name: coreml
tags:
- coreml
- embeddings
- sentence-similarity
- retrieval
- hark
base_model: intfloat/multilingual-e5-small
---
# multilingual-e5-small β€” CoreML (int8) for Hark
A **CoreML** (`mlprogram`, **int8-weight-quantized**) conversion of
[`intfloat/multilingual-e5-small`](https://huggingface.co/intfloat/multilingual-e5-small)
(384-dim, multilingual), packaged for on-device vault search in
[**Hark**](https://github.com/tuanda2912/hark) β€” a local-first, macOS-only meeting
transcription app. Runs on the Apple Neural Engine; **nothing is sent off the
machine** (Hark embeds the whole vault locally).
This repo exists so Hark can download a ready-to-run CoreML artifact instead of
shipping it in the app bundle. It is a faithful conversion β€” see *Provenance* and
*Validation* β€” not a new model.
## Files
| File | What |
|---|---|
| `MultilingualE5Small.mlpackage/` | the CoreML model (int8 weights, ~113 MB) |
| `tokenizer.json`, `tokenizer_config.json`, `special_tokens_map.json` | XLM-RoBERTa tokenizer (SentencePiece Unigram) |
| `sentencepiece.bpe.model` | the SentencePiece model |
Hark's loader snapshots this repo at a **pinned revision** into its app-support
models dir, compiles the `.mlpackage` to the ANE, and runs fully offline
thereafter.
## I/O contract
- **inputs:** `input_ids` (int32 `[1, L]`), `attention_mask` (int32 `[1, L]`), flexible `L ∈ 1..512`
- **output:** `last_hidden_state` (float32 `[1, L, 384]`)
- Hark applies **masked mean-pooling + L2-normalization** in Swift, and the e5
asymmetric prefixes (`"query: "` / `"passage: "`). Reproduce those if you reuse
this model directly.
## Provenance
- Converted from `intfloat/multilingual-e5-small` at source revision
**`614241f622f53c4eeff9890bdc4f31cfecc418b3`** via
[`engine/scripts/convert-embedder-coreml.py`](https://github.com/tuanda2912/hark/blob/main/engine/scripts/convert-embedder-coreml.py)
(coremltools 9, `convert_to="mlprogram"`, `minimum_deployment_target=macOS14`).
- int8 weight quantization (per-channel, symmetric) via
[`engine/scripts/quantize-embedder-int8.py`](https://github.com/tuanda2912/hark/blob/main/engine/scripts/quantize-embedder-int8.py)
(`coremltools.optimize.coreml.linear_quantize_weights`).
## Validation
- **Fidelity:** worst-case cosine between the fp16 and int8 pooled+L2-normalized
embeddings was **0.99986** across EN/VI/TH probe sentences β€” the int8 weights
are essentially indistinguishable from fp16 for retrieval.
- **On-device:** Hark's gated cross-lingual + end-to-end retrieval tests pass on
the Apple Neural Engine with this int8 artifact (EN↔VI/TH closer than
unrelated; full chunk β†’ embed β†’ index β†’ retrieve pipeline).
## License
MIT, inherited from `intfloat/multilingual-e5-small`. This is a format conversion
+ weight quantization of that model; all credit to the original authors.