potion-mxbai-512d

A static embedding model that outperforms potion-base-32M โ€” the previous best static embedding model โ€” by using a stronger teacher model and matching its architecture. Trained on 1M C4 sentences.

Update: A newer version trained on 2M sentences is available: potion-mxbai-2m-512d โ€” scoring 71.28 avg (+0.55 over this model, +2.23 STS).

Highlights

  • 70.73 avg on MTEB English (STS + Classification + PairClassification) vs potion-base-32M's 69.96
  • +3.62 STS points over potion-base-32M (69.36 vs 65.74)
  • 500x faster than transformer-based embedding models on CPU
  • ~32MB model size (63K vocab x 512 dims, float16)
  • Pure numpy inference โ€” no GPU needed

How It Was Made

  1. Teacher: mixedbread-ai/mxbai-embed-large-v1 (335M params, BERT-large architecture, MTEB 64.68)
  2. Custom vocabulary: 56K tokens built from 1M C4 English sentences via corpus frequency analysis
  3. Distillation: model2vec distillation with 512-dim PCA
  4. Tokenlearn pre-training: Contrastive loss training on 1M C4 sentences using tokenlearn

Benchmark Results

Model STS Classification PairClassification Avg
potion-mxbai-512d (this) 69.36 65.52 77.32 70.73
potion-base-32M 65.74 65.96 78.17 69.96

Per-task breakdown vs potion-base-32M

Task Ours Potion Diff
STS22 65.79 36.69 +29.09
STS17 34.21 29.85 +4.36
EmotionClassification 51.59 48.29 +3.30
STS12 65.37 62.72 +2.64
ImdbClassification 71.73 70.13 +1.61
TweetSentimentExtraction 57.69 56.58 +1.11
STS15 81.62 80.76 +0.86
BIOSSES 78.27 77.56 +0.72
STS13 77.84 77.59 +0.24
SICK-R 65.78 65.67 +0.12
SprintDuplicateQuestions 92.60 92.55 +0.04

Wins on 11/25 tasks.

Usage

from model2vec import StaticModel

model = StaticModel.from_pretrained("blobbybob/potion-mxbai-512d")
embeddings = model.encode(["Hello world", "Static embeddings are fast"])

With Sentence Transformers

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("blobbybob/potion-mxbai-512d")
embeddings = model.encode(["Hello world", "Static embeddings are fast"])

Training Details

  • Compute: Modal A10G GPU
  • Featurization: ~2 hours (1M C4 sentences through mxbai-embed-large-v1)
  • Training: ~24 minutes (7 epochs, contrastive loss, early stopping)
  • Total cost: ~$5-6 on Modal

Citation

@article{minishlab2024model2vec,
  author = {Tulkens, Stephan and {van Dongen}, Thomas},
  title = {Model2Vec: Fast State-of-the-Art Static Embeddings},
  year = {2024},
  url = {https://github.com/MinishLab/model2vec}
}
Downloads last month
38
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for blobbybob/potion-mxbai-512d

Finetuned
(52)
this model

Evaluation results