MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers
Paper โข 2002.10957 โข Published โข 2
ONNX export of sentence-transformers/all-MiniLM-L6-v2, built for Ziv โ a lightweight local code intelligence tool.
| Standard | This model | |
|---|---|---|
| Runtime | PyTorch + sentence-transformers | onnxruntime only |
| Install size | ~800MB | ~92MB |
| Inference | Same | Same |
No PyTorch. No sentence-transformers. Just onnxruntime.
ziv init --model fast
ziv start
pip install onnxruntime tokenizers huggingface_hub hf_transfer numpy
from huggingface_hub import snapshot_download
snapshot_download(
repo_id="ziv-ai/embedder-fast-onnx",
repo_type="model",
local_dir=".ziv/models/embedder-fast-onnx",
local_dir_use_symlinks=False,
)
import os
import numpy as np
import onnxruntime as ort
from tokenizers import Tokenizer
model_dir = ".ziv/models/embedder-fast-onnx"
tokenizer = Tokenizer.from_file(os.path.join(model_dir, "tokenizer.json"))
tokenizer.enable_truncation(max_length=256)
tokenizer.enable_padding(pad_token="[PAD]", length=None)
session = ort.InferenceSession(
os.path.join(model_dir, "model.onnx"),
providers=["CPUExecutionProvider"]
)
texts = ["How does authentication work?", "Explain the login flow"]
encoded = tokenizer.encode_batch(texts)
input_ids = np.array([e.ids for e in encoded], dtype=np.int64)
attention_mask = np.array([e.attention_mask for e in encoded], dtype=np.int64)
token_type_ids = np.array([e.type_ids for e in encoded], dtype=np.int64)
outputs = session.run(None, {
"input_ids": input_ids,
"attention_mask": attention_mask,
"token_type_ids": token_type_ids,
})
# Mean pooling
token_embeddings = outputs[0]
mask = attention_mask[..., None].astype(np.float32)
embeddings = (token_embeddings * mask).sum(1) / mask.sum(1).clip(min=1e-9)
# L2 normalize
norms = np.linalg.norm(embeddings, axis=1, keepdims=True).clip(min=1e-12)
embeddings = embeddings / norms
print(embeddings.shape) # (2, 384)
print(embeddings @ embeddings.T) # cosine similarity matrix (2x2)
| Property | Value |
|---|---|
| Base model | sentence-transformers/all-MiniLM-L6-v2 |
| Embedding dimensions | 384 |
| Max sequence length | 256 |
| Pooling strategy | Mean pooling |
| Normalization | L2 |
| ONNX opset | 18 |
| Model size | ~92MB |
| File | Description |
|---|---|
model.onnx |
ONNX model weights and graph |
tokenizer.json |
Tokenizer vocabulary and rules |
tokenizer_config.json |
Tokenizer settings |
1_Pooling/config.json |
Pooling strategy config |
2_Normalize/ |
Signals L2 normalization is applied |
config.json |
Model architecture config |
modules.json |
Sentence Transformers pipeline order |
This model is released under the Apache 2.0 License, the same license as the original sentence-transformers/all-MiniLM-L6-v2 model.
Original model by Microsoft and the sentence-transformers team.
If you use this in your work, please cite the original model:
@article{wang2020minilm,
title={MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers},
author={Wang, Wenhui and Wei, Furu and Dong, Li and Bao, Hangbo and Yang, Nan and Zhou, Ming},
journal={arXiv preprint arXiv:2002.10957},
year={2020}
}
Base model
sentence-transformers/all-MiniLM-L6-v2