potion-mxbai-512d

A static embedding model that outperforms potion-base-32M — the previous best static embedding model — by using a stronger teacher model and matching its architecture. Trained on 1M C4 sentences.

Update: A newer version trained on 2M sentences is available: potion-mxbai-2m-512d — scoring 71.28 avg (+0.55 over this model, +2.23 STS).

Highlights

70.73 avg on MTEB English (STS + Classification + PairClassification) vs potion-base-32M's 69.96
+3.62 STS points over potion-base-32M (69.36 vs 65.74)
500x faster than transformer-based embedding models on CPU
~32MB model size (63K vocab x 512 dims, float16)
Pure numpy inference — no GPU needed

How It Was Made

Teacher: mixedbread-ai/mxbai-embed-large-v1 (335M params, BERT-large architecture, MTEB 64.68)
Custom vocabulary: 56K tokens built from 1M C4 English sentences via corpus frequency analysis
Distillation: model2vec distillation with 512-dim PCA
Tokenlearn pre-training: Contrastive loss training on 1M C4 sentences using tokenlearn

Benchmark Results

Model	STS	Classification	PairClassification	Avg
potion-mxbai-512d (this)	69.36	65.52	77.32	70.73
potion-base-32M	65.74	65.96	78.17	69.96

Per-task breakdown vs potion-base-32M

Task	Ours	Potion	Diff
STS22	65.79	36.69	+29.09
STS17	34.21	29.85	+4.36
EmotionClassification	51.59	48.29	+3.30
STS12	65.37	62.72	+2.64
ImdbClassification	71.73	70.13	+1.61
TweetSentimentExtraction	57.69	56.58	+1.11
STS15	81.62	80.76	+0.86
BIOSSES	78.27	77.56	+0.72
STS13	77.84	77.59	+0.24
SICK-R	65.78	65.67	+0.12
SprintDuplicateQuestions	92.60	92.55	+0.04

Wins on 11/25 tasks.

Usage

from model2vec import StaticModel

model = StaticModel.from_pretrained("blobbybob/potion-mxbai-512d")
embeddings = model.encode(["Hello world", "Static embeddings are fast"])

With Sentence Transformers

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("blobbybob/potion-mxbai-512d")
embeddings = model.encode(["Hello world", "Static embeddings are fast"])

Training Details

Compute: Modal A10G GPU
Featurization: ~2 hours (1M C4 sentences through mxbai-embed-large-v1)
Training: ~24 minutes (7 epochs, contrastive loss, early stopping)
Total cost: ~$5-6 on Modal

Citation

@article{minishlab2024model2vec,
  author = {Tulkens, Stephan and {van Dongen}, Thomas},
  title = {Model2Vec: Fast State-of-the-Art Static Embeddings},
  year = {2024},
  url = {https://github.com/MinishLab/model2vec}
}

Downloads last month: 4

Safetensors

Model size

32.4M params

Tensor type

I64

F32

Model tree for blobbybob/potion-mxbai-512d

Base model

mixedbread-ai/mxbai-embed-large-v1

Finetuned

(55)

this model

Evaluation results

spearman_cosine
self-reported

69.360
accuracy
self-reported

65.520
ap
self-reported

77.320