potion-mxbai-512d
A static embedding model that outperforms potion-base-32M โ the previous best static embedding model โ by using a stronger teacher model and matching its architecture. Trained on 1M C4 sentences.
Update: A newer version trained on 2M sentences is available: potion-mxbai-2m-512d โ scoring 71.28 avg (+0.55 over this model, +2.23 STS).
Highlights
- 70.73 avg on MTEB English (STS + Classification + PairClassification) vs potion-base-32M's 69.96
- +3.62 STS points over potion-base-32M (69.36 vs 65.74)
- 500x faster than transformer-based embedding models on CPU
- ~32MB model size (63K vocab x 512 dims, float16)
- Pure numpy inference โ no GPU needed
How It Was Made
- Teacher: mixedbread-ai/mxbai-embed-large-v1 (335M params, BERT-large architecture, MTEB 64.68)
- Custom vocabulary: 56K tokens built from 1M C4 English sentences via corpus frequency analysis
- Distillation: model2vec distillation with 512-dim PCA
- Tokenlearn pre-training: Contrastive loss training on 1M C4 sentences using tokenlearn
Benchmark Results
| Model | STS | Classification | PairClassification | Avg |
|---|---|---|---|---|
| potion-mxbai-512d (this) | 69.36 | 65.52 | 77.32 | 70.73 |
| potion-base-32M | 65.74 | 65.96 | 78.17 | 69.96 |
Per-task breakdown vs potion-base-32M
| Task | Ours | Potion | Diff |
|---|---|---|---|
| STS22 | 65.79 | 36.69 | +29.09 |
| STS17 | 34.21 | 29.85 | +4.36 |
| EmotionClassification | 51.59 | 48.29 | +3.30 |
| STS12 | 65.37 | 62.72 | +2.64 |
| ImdbClassification | 71.73 | 70.13 | +1.61 |
| TweetSentimentExtraction | 57.69 | 56.58 | +1.11 |
| STS15 | 81.62 | 80.76 | +0.86 |
| BIOSSES | 78.27 | 77.56 | +0.72 |
| STS13 | 77.84 | 77.59 | +0.24 |
| SICK-R | 65.78 | 65.67 | +0.12 |
| SprintDuplicateQuestions | 92.60 | 92.55 | +0.04 |
Wins on 11/25 tasks.
Usage
from model2vec import StaticModel
model = StaticModel.from_pretrained("blobbybob/potion-mxbai-512d")
embeddings = model.encode(["Hello world", "Static embeddings are fast"])
With Sentence Transformers
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("blobbybob/potion-mxbai-512d")
embeddings = model.encode(["Hello world", "Static embeddings are fast"])
Training Details
- Compute: Modal A10G GPU
- Featurization: ~2 hours (1M C4 sentences through mxbai-embed-large-v1)
- Training: ~24 minutes (7 epochs, contrastive loss, early stopping)
- Total cost: ~$5-6 on Modal
Citation
@article{minishlab2024model2vec,
author = {Tulkens, Stephan and {van Dongen}, Thomas},
title = {Model2Vec: Fast State-of-the-Art Static Embeddings},
year = {2024},
url = {https://github.com/MinishLab/model2vec}
}
- Downloads last month
- 38
Model tree for blobbybob/potion-mxbai-512d
Base model
mixedbread-ai/mxbai-embed-large-v1Evaluation results
- spearman_cosineself-reported69.360
- accuracyself-reported65.520
- apself-reported77.320