Initial upload of paraphrase-multilingual-MiniLM-L12-v2 exports

02810b5 verified 22 days ago

6.21 kB

	---
	license: apache-2.0
	---

	# Introduction

	This repository hosts the [paraphrase-multilingual-MiniLM-L12-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2/tree/main) model for the [React Native ExecuTorch](https://www.npmjs.com/package/react-native-executorch) library. It includes the model exported for both the XNNPACK (Android / generic CPU) and CoreML (Apple) delegates, in multiple precisions, ready for use in the ExecuTorch runtime.

	If you'd like to run these models in your own ExecuTorch runtime, refer to the [official documentation](https://pytorch.org/executorch/stable/index.html) for setup instructions.

	## Compatibility

	If you intend to use this model outside of React Native ExecuTorch, make sure your runtime is compatible with the ExecuTorch version used to export the `.pte` files. For more details, see the compatibility note in the [ExecuTorch GitHub repository](https://github.com/pytorch/executorch/blob/11d1742fdeddcf05bc30a6cfac321d2a2e3b6768/runtime/COMPATIBILITY.md?plain=1#L4). If you work with React Native ExecuTorch, the constants from the library will guarantee compatibility with the runtime used behind the scenes.

	These models were exported using React Native ExecuTorch `v0.9.0`, which ships an ExecuTorch runtime derived from the `v1.2.0` release branch and an updated `pytorch/extension/llm/tokenizers` build that adds Unigram / Precompiled normalizer / Metaspace decoder support — required to load this model's tokenizer. No forward compatibility is guaranteed — older versions of the runtime may not work with these files; in particular, RNE ≤ 0.8.x cannot load `tokenizer.json` and will fail at the tokenizer-load step.

	## Variant Matrix

	\| Delegate \| Precision \| File \| Size \| Notes \|
	\|----------\|-----------\|-----------------------------------------------------------------------------------\|---------\|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------\|
	\| XNNPACK \| fp32 \| `xnnpack/paraphrase-multilingual-MiniLM-L12-v2_xnnpack_fp32.pte` \| 449 MB \| Baseline. Works on Android / iOS / generic CPU. \|
	\| XNNPACK \| 8da4w \| `xnnpack/paraphrase-multilingual-MiniLM-L12-v2_xnnpack_8da4w.pte` \| 379 MB \| Int8 dynamic activation + Int4 weight (torchao), group_size=32. Embeddings stay fp32 — the bulk of the file is the 250 037 × 384 vocab matrix (≈ 384 MB), so the linear-layer quantization yields only a modest size win. \|
	\| CoreML \| fp32 \| `coreml/paraphrase-multilingual-MiniLM-L12-v2_coreml_fp32.pte` \| 449 MB \| Apple Neural Engine / GPU / CPU, float32 compute. \|
	\| CoreML \| fp16 \| `coreml/paraphrase-multilingual-MiniLM-L12-v2_coreml_fp16.pte` \| 225 MB \| Half-sized via `compute_precision=FLOAT16` at CoreML compile. Cleanest size win on iOS. \|

	Pick the variant that matches your platform + size/quality trade-off. The CoreML variants only load on Apple platforms; the XNNPACK variants load everywhere.

	## Repository Structure

	- `xnnpack/` — `.pte` files partitioned for the XNNPACK delegate.
	- `coreml/` — `.pte` files partitioned for the CoreML delegate (iOS / macOS only).
	- `tokenizer.json` — HuggingFace fast-tokenizer dump (Unigram model + Precompiled normalizer + Metaspace decoder, derived from the upstream SentencePiece tokenizer). Wire this to `tokenizerSource`.
	- `config.json`, `tokenizer_config.json` — upstream model/tokenizer configs, kept for reference and for non-RNE consumers.

	The `.pte` path goes to `modelSource`; `tokenizer.json` is shared across all variants.

	## Model details

	- Architecture: 12-layer, 12-head BERT with hidden size 384 (initialized from `xlm-roberta-base`) + mean pooling + L2 norm. No additional dense projection head — the model output dim equals the encoder hidden size.
	- Output dimension: 384.
	- Max sequence length: 126 tokens (128 − 2 for the `<s>` / `</s>` wrapping; the exporter concatenates these XLM-R-style start/end tokens at id 0 / 2 inside the program).
	- Vocabulary: 250 037 SentencePiece pieces.
	- Languages: 50+ (multilingual).
	- Typical strength: cross-lingual sentence similarity and medium-length sentence retrieval — designed for paraphrase mining and cross-lingual search. Short single-word queries in non-English languages are this model's weakest case; longer sentences and/or English inputs give markedly better ranking.

	## Export notes

	The exporter wraps the HuggingFace transformer with the standard sentence-transformers contract: token IDs go in, the program prepends `<s>` and appends `</s>`, mean pooling is applied to the last hidden state weighted by the attention mask, and the output is L2-normalized to a 384-d vector.

	Unsupported combinations (rejected by the exporter, documented for reference):

	- XNNPACK + fp16 — `model.to(torch.float16)` causes softmax / LayerNorm overflow and the runtime output is NaN. XNNPACK's size wins come from quantization, not fp16.
	- CoreML + 8da4w — `coremltools` has no MIL mapping for the `torch.int8` tensors torchao emits (`KeyError: torch.int8`). The CoreML-native way to shrink further is `ct.optimize.coreml` palette/linear quantization, not torchao source transforms.