mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval
Paper
•
2407.19669
•
Published
•
25
A static embedding model distilled from Alibaba-NLP/gte-multilingual-base using Model2Vec.
| Property | Value |
|---|---|
| Dimensions | 768 |
| Vocabulary | ~250,000 tokens |
| Base Model | Alibaba-NLP/gte-multilingual-base |
| Distillation Method | Model2Vec (PCA + SIF weighting) |
| Speed | ~3,000+ texts/second (CPU) |
| Languages | 70+ (inherited from GTE) |
| Task | Language | Accuracy | F1 |
|---|---|---|---|
| Banking77Classification | EN | 52.7% | 51.3% |
| AmazonReviewsClassification | DE | 28.6% | 27.9% |
Note: These scores are typical for static embedding models. The advantage is speed (~3000 texts/s vs ~20 texts/s for transformer models).
| Model | Banking77 (EN) |
|---|---|
| GloVe | ~35% |
| FastText | ~40% |
| m2v-gte-multilingual-768 | 52.7% |
| potion-base-8M (official) | ~55% |
Tested on multilabel text classification (German educational content, 44 labels):
| Metric | Score |
|---|---|
| F1 Macro | 82.9% |
| F1 Micro | 88.2% |
| Precision Macro | 90.9% |
from model2vec import StaticModel
model = StaticModel.from_pretrained("JanSchachtschabel/m2v-gte-multilingual-768")
embeddings = model.encode(["Beispieltext auf Deutsch", "Example text in English"])
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("JanSchachtschabel/m2v-gte-multilingual-768")
embeddings = model.encode(["Beispieltext auf Deutsch"])
pip install model2vec
# or
pip install sentence-transformers
This model is released under the Apache 2.0 License.
If you use this model, please cite:
@article{zhang2024mgte,
title={mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval},
author={Zhang, Xin and Zhang, Yanzhao and Long, Dingkun and Xie, Pengjun and Zhang, Meishan and Min, Zhang},
journal={arXiv preprint arXiv:2407.19669},
year={2024}
}
@article{minishlab2024model2vec,
author = {Tulkens, Stephan and {van Dongen}, Thomas},
title = {Model2Vec: Fast State-of-the-Art Static Embeddings},
year = {2024},
url = {https://github.com/MinishLab/model2vec}
}
Base model
Alibaba-NLP/gte-multilingual-base