| | --- |
| | library_name: sentence-transformers |
| | pipeline_tag: sentence-similarity |
| | tags: |
| | - sentence-transformers |
| | - sentence-similarity |
| | - quantized |
| | - onnx |
| | - clustering |
| | model-index: |
| | - name: sentence-transformers/all-MiniLM-L6-v2-quantized |
| | results: |
| | - task: |
| | type: semantic-similarity |
| | name: Semantic Similarity |
| | dataset: |
| | type: semantic-similarity |
| | name: Semantic Similarity |
| | metrics: |
| | - type: similarity |
| | value: 0.95+ |
| | name: Cosine Similarity (vs Original) |
| | --- |
| | |
| | # Quantized SentenceTransformer: all-MiniLM-L6-v2 |
| |
|
| | This is a quantized version of the popular [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) model, optimized for production deployment. |
| |
|
| | ## Model Details |
| |
|
| | - **Base Model**: sentence-transformers/all-MiniLM-L6-v2 |
| | - **Quantization**: INT8 dynamic quantization using ONNX Runtime |
| | - **Size Reduction**: ~75% smaller than the original model |
| | - **Performance**: 95%+ similarity to original model embeddings |
| | - **Format**: ONNX |
| |
|
| | ## Files |
| |
|
| | - `model-quant.onnx`: Quantized INT8 model (recommended for production) |
| | - `model.onnx`: Original FP32 ONNX model |
| |
|
| | ## Usage |
| |
|
| | ### With ONNX Runtime (Recommended) |
| |
|
| | ```python |
| | import onnxruntime as ort |
| | import numpy as np |
| | from transformers import AutoTokenizer |
| | |
| | # Load the quantized model |
| | session = ort.InferenceSession("model-quant.onnx") |
| | tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2") |
| | |
| | def encode_text(text): |
| | # Tokenize |
| | inputs = tokenizer(text, return_tensors="np", padding=True, truncation=True, max_length=512) |
| | |
| | # Run inference |
| | outputs = session.run(None, { |
| | "input_ids": inputs["input_ids"], |
| | "attention_mask": inputs["attention_mask"] |
| | }) |
| | |
| | # Apply mean pooling |
| | last_hidden_state = outputs[0] |
| | attention_mask_expanded = np.expand_dims(inputs["attention_mask"], -1) |
| | attention_mask_expanded = np.broadcast_to(attention_mask_expanded, last_hidden_state.shape) |
| | |
| | masked_embeddings = last_hidden_state * attention_mask_expanded |
| | summed = np.sum(masked_embeddings, axis=1) |
| | summed_mask = np.sum(attention_mask_expanded, axis=1) |
| | embedding = summed / np.maximum(summed_mask, 1e-9) |
| | |
| | return embedding[0] |
| | |
| | # Example usage |
| | text = "I love this product!" |
| | embedding = encode_text(text) |
| | print(f"Embedding shape: {embedding.shape}") |
| | ``` |
| |
|
| | ### With SentenceTransformers (Original) |
| |
|
| | For comparison with the original model: |
| |
|
| | ```python |
| | from sentence_transformers import SentenceTransformer |
| | |
| | model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2') |
| | embedding = model.encode("I love this product!") |
| | ``` |
| |
|
| | ## Performance Comparison |
| |
|
| | | Model | Size | Inference Speed | Memory Usage | Similarity to Original | |
| | |-------|------|----------------|--------------|----------------------| |
| | | Original | ~90MB | 1.0x | 1.0x | 100% | |
| | | Quantized | ~23MB | 1.2-1.5x | 0.6x | 95%+ | |
| |
|
| | ## Use Cases |
| |
|
| | - **Text Clustering**: Group similar texts together |
| | - **Semantic Search**: Find semantically similar documents |
| | - **Recommendation Systems**: Content-based recommendations |
| | - **Duplicate Detection**: Find near-duplicate texts |
| |
|
| | ## Technical Details |
| |
|
| | - **Embedding Dimension**: 384 |
| | - **Max Sequence Length**: 512 tokens |
| | - **Quantization Method**: Dynamic INT8 quantization |
| | - **Framework**: ONNX Runtime |
| |
|
| | ## Citation |
| |
|
| | If you use this model, please cite the original work: |
| |
|
| | ```bibtex |
| | @inproceedings{reimers-2019-sentence-bert, |
| | title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks", |
| | author = "Reimers, Nils and Gurevych, Iryna", |
| | booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing", |
| | month = "11", |
| | year = "2019", |
| | publisher = "Association for Computational Linguistics", |
| | url = "http://arxiv.org/abs/1908.10084", |
| | } |
| | ``` |
| |
|