--- license: apache-2.0 --- # Introduction This repository hosts the [distiluse-base-multilingual-cased-v2](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v2/tree/main) model for the [React Native ExecuTorch](https://www.npmjs.com/package/react-native-executorch) library. It includes the model exported for both the **XNNPACK** (Android / generic CPU) and **CoreML** (Apple) delegates, in multiple precisions, ready for use in the **ExecuTorch** runtime. If you'd like to run these models in your own ExecuTorch runtime, refer to the [official documentation](https://pytorch.org/executorch/stable/index.html) for setup instructions. ## Compatibility If you intend to use this model outside of React Native ExecuTorch, make sure your runtime is compatible with the **ExecuTorch** version used to export the `.pte` files. For more details, see the compatibility note in the [ExecuTorch GitHub repository](https://github.com/pytorch/executorch/blob/11d1742fdeddcf05bc30a6cfac321d2a2e3b6768/runtime/COMPATIBILITY.md?plain=1#L4). If you work with React Native ExecuTorch, the constants from the library will guarantee compatibility with the runtime used behind the scenes. These models were exported using React Native ExecuTorch `v0.9.0`, which ships an ExecuTorch runtime derived from the `v1.2.0` release branch. **No forward compatibility** is guaranteed — older versions of the runtime may not work with these files. ## Variant Matrix | Delegate | Precision | File | Size | RMSE vs eager | Notes | |----------|-----------|------------------------------------------------------------------------------|--------|---------------|--------------------------------------------------------------------| | XNNPACK | fp32 | `xnnpack/distiluse-base-multilingual-cased-v2_xnnpack_fp32.pte` | 516 MB | 0.0 | Baseline. Works on Android / iOS / generic CPU. | | XNNPACK | 8da4w | `xnnpack/distiluse-base-multilingual-cased-v2_xnnpack_8da4w.pte` | 375 MB | 5.4e-4 | Int8 dynamic activation + Int4 weight (torchao), group_size=32. Embeddings stay fp32 — the bulk of the size reduction comes from linear layers. | | CoreML | fp32 | `coreml/distiluse-base-multilingual-cased-v2_coreml_fp32.pte` | 516 MB | 0.0 | Apple Neural Engine / GPU / CPU, float32 compute. | | CoreML | fp16 | `coreml/distiluse-base-multilingual-cased-v2_coreml_fp16.pte` | 258 MB | 1.9e-4 | Half-sized via `compute_precision=FLOAT16` at CoreML compile. Cleanest size win on iOS. | Pick the variant that matches your platform + size/quality trade-off. The CoreML variants only load on Apple platforms; the XNNPACK variants load everywhere. ## Repository Structure - `xnnpack/` — `.pte` files partitioned for the XNNPACK delegate. - `coreml/` — `.pte` files partitioned for the CoreML delegate (iOS / macOS only). - `tokenizer.json` — HuggingFace fast-tokenizer dump (WordPiece + BertNormalizer). Wire this to `tokenizerSource`. - `config.json`, `tokenizer_config.json` — upstream model/tokenizer configs, kept for reference and for non-RNE consumers. The `.pte` path goes to `modelSource`; `tokenizer.json` is shared across all variants. ## Model details - Architecture: DistilBERT multilingual cased + mean pooling + Dense (768→512, Tanh) + L2 norm. - Output dimension: **512**. - Max sequence length: **126** tokens (128 − 2 for `[CLS]` / `[SEP]`). - Languages: 50+ (multilingual). - Typical strength: cross-lingual sentence similarity and medium-length sentence retrieval. Short single-word queries in non-English languages are this model's weakest case — for those, longer sentences and/or English inputs give markedly better ranking. ## Export notes The exported program skips HuggingFace's internal attention-mask-to-4D conversion because the RNE runtime never pads at inference (single sentence, no batching). This preserves bit-exactness with the PyTorch reference (RMSE 0 on fp32 random input) while trimming ~27% off the XNNPACK forward wall-time and keeping XNNPACK delegation around 89–91% of graph runtime. Unsupported combinations (rejected by the exporter, documented for reference): - **XNNPACK + fp16** — `model.to(torch.float16)` causes softmax / LayerNorm overflow and the runtime output is NaN. XNNPACK's size wins come from quantization, not fp16. - **CoreML + 8da4w** — `coremltools` has no MIL mapping for the `torch.int8` tensors torchao emits (`KeyError: torch.int8`). The CoreML-native way to shrink further is `ct.optimize.coreml` palette/linear quantization, not torchao source transforms.