Add int8 CoreML multilingual-e5-small for Hark vault RAG

0a386d4 verified 4 days ago

2.96 kB

license: mit
language:
  - multilingual
  - en
  - vi
  - th
library_name: coreml
tags:
  - coreml
  - embeddings
  - sentence-similarity
  - retrieval
  - hark
base_model: intfloat/multilingual-e5-small

multilingual-e5-small — CoreML (int8) for Hark

A CoreML (mlprogram, int8-weight-quantized) conversion of intfloat/multilingual-e5-small (384-dim, multilingual), packaged for on-device vault search in Hark — a local-first, macOS-only meeting transcription app. Runs on the Apple Neural Engine; nothing is sent off the machine (Hark embeds the whole vault locally).

This repo exists so Hark can download a ready-to-run CoreML artifact instead of shipping it in the app bundle. It is a faithful conversion — see Provenance and Validation — not a new model.

Files

File	What
`MultilingualE5Small.mlpackage/`	the CoreML model (int8 weights, ~113 MB)
`tokenizer.json`, `tokenizer_config.json`, `special_tokens_map.json`	XLM-RoBERTa tokenizer (SentencePiece Unigram)
`sentencepiece.bpe.model`	the SentencePiece model

Hark's loader snapshots this repo at a pinned revision into its app-support models dir, compiles the .mlpackage to the ANE, and runs fully offline thereafter.

I/O contract

inputs: input_ids (int32 [1, L]), attention_mask (int32 [1, L]), flexible L ∈ 1..512
output: last_hidden_state (float32 [1, L, 384])
Hark applies masked mean-pooling + L2-normalization in Swift, and the e5 asymmetric prefixes ("query: " / "passage: "). Reproduce those if you reuse this model directly.

Provenance

Converted from intfloat/multilingual-e5-small at source revision 614241f622f53c4eeff9890bdc4f31cfecc418b3 via engine/scripts/convert-embedder-coreml.py (coremltools 9, convert_to="mlprogram", minimum_deployment_target=macOS14).
int8 weight quantization (per-channel, symmetric) via engine/scripts/quantize-embedder-int8.py (coremltools.optimize.coreml.linear_quantize_weights).

Validation

Fidelity: worst-case cosine between the fp16 and int8 pooled+L2-normalized embeddings was 0.99986 across EN/VI/TH probe sentences — the int8 weights are essentially indistinguishable from fp16 for retrieval.
On-device: Hark's gated cross-lingual + end-to-end retrieval tests pass on the Apple Neural Engine with this int8 artifact (EN↔VI/TH closer than unrelated; full chunk → embed → index → retrieve pipeline).

License

MIT, inherited from intfloat/multilingual-e5-small. This is a format conversion

weight quantization of that model; all credit to the original authors.