| --- |
| license: apache-2.0 |
| --- |
| |
| # Introduction |
|
|
| This repository hosts the [paraphrase-multilingual-MiniLM-L12-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2/tree/main) model for the [React Native ExecuTorch](https://www.npmjs.com/package/react-native-executorch) library. It includes the model exported for both the **XNNPACK** (Android / generic CPU) and **CoreML** (Apple) delegates, in multiple precisions, ready for use in the **ExecuTorch** runtime. |
|
|
| If you'd like to run these models in your own ExecuTorch runtime, refer to the [official documentation](https://pytorch.org/executorch/stable/index.html) for setup instructions. |
|
|
| ## Compatibility |
|
|
| If you intend to use this model outside of React Native ExecuTorch, make sure your runtime is compatible with the **ExecuTorch** version used to export the `.pte` files. For more details, see the compatibility note in the [ExecuTorch GitHub repository](https://github.com/pytorch/executorch/blob/11d1742fdeddcf05bc30a6cfac321d2a2e3b6768/runtime/COMPATIBILITY.md?plain=1#L4). If you work with React Native ExecuTorch, the constants from the library will guarantee compatibility with the runtime used behind the scenes. |
|
|
| These models were exported using React Native ExecuTorch `v0.9.0`, which ships an ExecuTorch runtime derived from the `v1.2.0` release branch and an updated `pytorch/extension/llm/tokenizers` build that adds Unigram / Precompiled normalizer / Metaspace decoder support β required to load this model's tokenizer. **No forward compatibility** is guaranteed β older versions of the runtime may not work with these files; in particular, RNE β€ 0.8.x cannot load `tokenizer.json` and will fail at the tokenizer-load step. |
|
|
| ## Variant Matrix |
|
|
| | Delegate | Precision | File | Size | Notes | |
| |----------|-----------|-----------------------------------------------------------------------------------|---------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| |
| | XNNPACK | fp32 | `xnnpack/paraphrase-multilingual-MiniLM-L12-v2_xnnpack_fp32.pte` | 449 MB | Baseline. Works on Android / iOS / generic CPU. | |
| | XNNPACK | 8da4w | `xnnpack/paraphrase-multilingual-MiniLM-L12-v2_xnnpack_8da4w.pte` | 379 MB | Int8 dynamic activation + Int4 weight (torchao), group_size=32. Embeddings stay fp32 β the bulk of the file is the 250 037 Γ 384 vocab matrix (β 384 MB), so the linear-layer quantization yields only a modest size win. | |
| | CoreML | fp32 | `coreml/paraphrase-multilingual-MiniLM-L12-v2_coreml_fp32.pte` | 449 MB | Apple Neural Engine / GPU / CPU, float32 compute. | |
| | CoreML | fp16 | `coreml/paraphrase-multilingual-MiniLM-L12-v2_coreml_fp16.pte` | 225 MB | Half-sized via `compute_precision=FLOAT16` at CoreML compile. Cleanest size win on iOS. | |
|
|
| Pick the variant that matches your platform + size/quality trade-off. The CoreML variants only load on Apple platforms; the XNNPACK variants load everywhere. |
|
|
| ## Repository Structure |
|
|
| - `xnnpack/` β `.pte` files partitioned for the XNNPACK delegate. |
| - `coreml/` β `.pte` files partitioned for the CoreML delegate (iOS / macOS only). |
| - `tokenizer.json` β HuggingFace fast-tokenizer dump (Unigram model + Precompiled normalizer + Metaspace decoder, derived from the upstream SentencePiece tokenizer). Wire this to `tokenizerSource`. |
| - `config.json`, `tokenizer_config.json` β upstream model/tokenizer configs, kept for reference and for non-RNE consumers. |
|
|
| The `.pte` path goes to `modelSource`; `tokenizer.json` is shared across all variants. |
|
|
| ## Model details |
|
|
| - Architecture: 12-layer, 12-head BERT with hidden size 384 (initialized from `xlm-roberta-base`) + mean pooling + L2 norm. No additional dense projection head β the model output dim equals the encoder hidden size. |
| - Output dimension: **384**. |
| - Max sequence length: **126** tokens (128 β 2 for the `<s>` / `</s>` wrapping; the exporter concatenates these XLM-R-style start/end tokens at id 0 / 2 inside the program). |
| - Vocabulary: 250 037 SentencePiece pieces. |
| - Languages: 50+ (multilingual). |
| - Typical strength: cross-lingual sentence similarity and medium-length sentence retrieval β designed for paraphrase mining and cross-lingual search. Short single-word queries in non-English languages are this model's weakest case; longer sentences and/or English inputs give markedly better ranking. |
|
|
| ## Export notes |
|
|
| The exporter wraps the HuggingFace transformer with the standard sentence-transformers contract: token IDs go in, the program prepends `<s>` and appends `</s>`, mean pooling is applied to the last hidden state weighted by the attention mask, and the output is L2-normalized to a 384-d vector. |
|
|
| Unsupported combinations (rejected by the exporter, documented for reference): |
|
|
| - **XNNPACK + fp16** β `model.to(torch.float16)` causes softmax / LayerNorm overflow and the runtime output is NaN. XNNPACK's size wins come from quantization, not fp16. |
| - **CoreML + 8da4w** β `coremltools` has no MIL mapping for the `torch.int8` tensors torchao emits (`KeyError: torch.int8`). The CoreML-native way to shrink further is `ct.optimize.coreml` palette/linear quantization, not torchao source transforms. |
|
|