m2v-gte-multilingual-768

A static embedding model distilled from Alibaba-NLP/gte-multilingual-base using Model2Vec.

Model Description

Property	Value
Dimensions	768
Vocabulary	~250,000 tokens
Base Model	Alibaba-NLP/gte-multilingual-base
Distillation Method	Model2Vec (PCA + SIF weighting)
Speed	~3,000+ texts/second (CPU)
Languages	70+ (inherited from GTE)

MTEB Benchmark Results

Task	Language	Accuracy	F1
Banking77Classification	EN	52.7%	51.3%
AmazonReviewsClassification	DE	28.6%	27.9%

Note: These scores are typical for static embedding models. The advantage is speed (~3000 texts/s vs ~20 texts/s for transformer models).

Comparison with Other Static Models

Model	Banking77 (EN)
GloVe	~35%
FastText	~40%
m2v-gte-multilingual-768	52.7%
potion-base-8M (official)	~55%

Custom Task Performance

Tested on multilabel text classification (German educational content, 44 labels):

Metric	Score
F1 Macro	82.9%
F1 Micro	88.2%
Precision Macro	90.9%

Usage

With Model2Vec (recommended)

from model2vec import StaticModel

model = StaticModel.from_pretrained("JanSchachtschabel/m2v-gte-multilingual-768")
embeddings = model.encode(["Beispieltext auf Deutsch", "Example text in English"])

With Sentence Transformers

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("JanSchachtschabel/m2v-gte-multilingual-768")
embeddings = model.encode(["Beispieltext auf Deutsch"])

Installation

pip install model2vec
# or
pip install sentence-transformers

License

This model is released under the Apache 2.0 License.

Attribution

Base Model: Alibaba-NLP/gte-multilingual-base by Alibaba DAMO Academy (Apache 2.0)
Distillation Method: Model2Vec by Minish Lab (MIT)

Citation

If you use this model, please cite:

@article{zhang2024mgte,
  title={mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval},
  author={Zhang, Xin and Zhang, Yanzhao and Long, Dingkun and Xie, Pengjun and Zhang, Meishan and Min, Zhang},
  journal={arXiv preprint arXiv:2407.19669},
  year={2024}
}

@article{minishlab2024model2vec,
  author = {Tulkens, Stephan and {van Dongen}, Thomas},
  title = {Model2Vec: Fast State-of-the-Art Static Embeddings},
  year = {2024},
  url = {https://github.com/MinishLab/model2vec}
}

Acknowledgments

Alibaba DAMO Academy for the excellent GTE multilingual model
Minish Lab for the Model2Vec distillation framework

Downloads last month: 11

Safetensors

Model size

0.2B params

Tensor type

F64

F16

Model tree for JanSchachtschabel/m2v-gte-multilingual-768

Base model

Alibaba-NLP/gte-multilingual-base

Finetuned

(105)

this model

Paper for JanSchachtschabel/m2v-gte-multilingual-768

mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval

Paper • 2407.19669 • Published Jul 29, 2024 • 26