Introduction

This repository hosts the distiluse-base-multilingual-cased-v2 model for the React Native ExecuTorch library. It includes the model exported for both the XNNPACK (Android / generic CPU) and CoreML (Apple) delegates, in multiple precisions, ready for use in the ExecuTorch runtime.

If you'd like to run these models in your own ExecuTorch runtime, refer to the official documentation for setup instructions.

Compatibility

If you intend to use this model outside of React Native ExecuTorch, make sure your runtime is compatible with the ExecuTorch version used to export the .pte files. For more details, see the compatibility note in the ExecuTorch GitHub repository. If you work with React Native ExecuTorch, the constants from the library will guarantee compatibility with the runtime used behind the scenes.

These models were exported using React Native ExecuTorch v0.9.0, which ships an ExecuTorch runtime derived from the v1.2.0 release branch. No forward compatibility is guaranteed β€” older versions of the runtime may not work with these files.

Variant Matrix

Delegate Precision File Size RMSE vs eager Notes
XNNPACK fp32 xnnpack/distiluse-base-multilingual-cased-v2_xnnpack_fp32.pte 516 MB 0.0 Baseline. Works on Android / iOS / generic CPU.
XNNPACK 8da4w xnnpack/distiluse-base-multilingual-cased-v2_xnnpack_8da4w.pte 375 MB 5.4e-4 Int8 dynamic activation + Int4 weight (torchao), group_size=32. Embeddings stay fp32 β€” the bulk of the size reduction comes from linear layers.
CoreML fp32 coreml/distiluse-base-multilingual-cased-v2_coreml_fp32.pte 516 MB 0.0 Apple Neural Engine / GPU / CPU, float32 compute.
CoreML fp16 coreml/distiluse-base-multilingual-cased-v2_coreml_fp16.pte 258 MB 1.9e-4 Half-sized via compute_precision=FLOAT16 at CoreML compile. Cleanest size win on iOS.

Pick the variant that matches your platform + size/quality trade-off. The CoreML variants only load on Apple platforms; the XNNPACK variants load everywhere.

Repository Structure

  • xnnpack/ β€” .pte files partitioned for the XNNPACK delegate.
  • coreml/ β€” .pte files partitioned for the CoreML delegate (iOS / macOS only).
  • tokenizer.json β€” HuggingFace fast-tokenizer dump (WordPiece + BertNormalizer). Wire this to tokenizerSource.
  • config.json, tokenizer_config.json β€” upstream model/tokenizer configs, kept for reference and for non-RNE consumers.

The .pte path goes to modelSource; tokenizer.json is shared across all variants.

Model details

  • Architecture: DistilBERT multilingual cased + mean pooling + Dense (768β†’512, Tanh) + L2 norm.
  • Output dimension: 512.
  • Max sequence length: 126 tokens (128 βˆ’ 2 for [CLS] / [SEP]).
  • Languages: 50+ (multilingual).
  • Typical strength: cross-lingual sentence similarity and medium-length sentence retrieval. Short single-word queries in non-English languages are this model's weakest case β€” for those, longer sentences and/or English inputs give markedly better ranking.

Export notes

The exported program skips HuggingFace's internal attention-mask-to-4D conversion because the RNE runtime never pads at inference (single sentence, no batching). This preserves bit-exactness with the PyTorch reference (RMSE 0 on fp32 random input) while trimming ~27% off the XNNPACK forward wall-time and keeping XNNPACK delegation around 89–91% of graph runtime.

Unsupported combinations (rejected by the exporter, documented for reference):

  • XNNPACK + fp16 β€” model.to(torch.float16) causes softmax / LayerNorm overflow and the runtime output is NaN. XNNPACK's size wins come from quantization, not fp16.
  • CoreML + 8da4w β€” coremltools has no MIL mapping for the torch.int8 tensors torchao emits (KeyError: torch.int8). The CoreML-native way to shrink further is ct.optimize.coreml palette/linear quantization, not torchao source transforms.
Downloads last month
158
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Collection including software-mansion/react-native-executorch-distiluse-base-multilingual-cased-v2