--- license: apache-2.0 tags: - onnx - int8 - quantized - sentence-similarity - embeddings - justembed base_model: sentence-transformers/all-mpnet-base-v2 library_name: onnxruntime pipeline_tag: feature-extraction --- # MPNet INT8 — ONNX Quantized ONNX INT8 quantized version of [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) for efficient general-purpose sentence embeddings. ## Model Details | Property | Value | |----------|-------| | Base Model | [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) | | Format | ONNX | | Quantization | INT8 (dynamic quantization) | | Embedding Dimension | 768 | | Quantized by | [JustEmbed](https://pypi.org/project/justembed/) | ## What is this? This is a quantized ONNX export of all-mpnet-base-v2, one of the best general-purpose sentence embedding models from the sentence-transformers library. It maps sentences and paragraphs to a 768-dimensional dense vector space. The INT8 quantization reduces model size and improves inference speed while maintaining high accuracy. ## Use Cases - Semantic text search - Sentence similarity - Clustering and topic modeling - Paraphrase detection - General-purpose text embeddings ## Files - `model_quantized.onnx` — INT8 quantized ONNX model - `tokenizer.json` — Fast tokenizer - `vocab.txt` — Vocabulary file - `config.json` — Model configuration ## Usage with JustEmbed ```python from justembed import Embedder embedder = Embedder("mpnet-int8") vectors = embedder.embed(["This is a sentence", "This is another sentence"]) ``` ## Usage with ONNX Runtime ```python import onnxruntime as ort from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained(".") session = ort.InferenceSession("model_quantized.onnx") inputs = tokenizer("This is a sentence", return_tensors="np") outputs = session.run(None, dict(inputs)) ``` ## Quantization Details - Method: Dynamic INT8 quantization via ONNX Runtime - Source: Original PyTorch weights converted to ONNX, then quantized - Speed: ~2-3x faster inference than FP32 - Size: ~4x smaller than FP32 ## License This model is a derivative work of [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2). The original model is licensed under **Apache License 2.0**. This quantized version is distributed under the same license. See the [LICENSE](LICENSE) file for the full text. ## Citation ```bibtex @inproceedings{song2020mpnet, title={MPNet: Masked and Permuted Pre-training for Language Understanding}, author={Song, Kaitao and Tan, Xu and Qin, Tao and Lu, Jianfeng and Liu, Tie-Yan}, booktitle={NeurIPS}, year={2020} } ``` ## Acknowledgments - Original model by [UKP Lab / sentence-transformers](https://www.sbert.net/) - Quantization and packaging by [JustEmbed](https://pypi.org/project/justembed/)