Multilingual E5 Text Embeddings: A Technical Report
Paper • 2402.05672 • Published • 21
How to use danielnoumon/multilingual-e5-large-ai-act-nl with sentence-transformers:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("danielnoumon/multilingual-e5-large-ai-act-nl")
sentences = [
"The weather is lovely today.",
"It's so sunny outside!",
"He drove to the stadium."
]
embeddings = model.encode(sentences)
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]Fine-tuned multilingual-e5-large for Dutch/English retrieval on EU AI Act documentation. Supports Matryoshka embeddings (1024, 768, 512, 256, 128, 64 dimensions) for flexible speed/quality tradeoffs.
Evaluated on 340 held-out queries across 85 chunks. All metrics measured with cosine similarity.
| Dim | Base | Stage 1 | Stage 2 | Delta (Base to S2) |
|---|---|---|---|---|
| 1024 | 0.8612 | 0.9426 | 0.9465 | +0.0853 |
| 768 | 0.8577 | 0.9411 | 0.9445 | +0.0868 |
| 512 | 0.8495 | 0.9379 | 0.9412 | +0.0917 |
| 256 | 0.7848 | 0.9383 | 0.9423 | +0.1575 |
| 128 | 0.7283 | 0.9225 | 0.9277 | +0.1994 |
| 64 | 0.6009 | 0.9011 | 0.9058 | +0.3049 |
Key insight: Matryoshka training flattened the quality curve. Dim=64 retains 96% of dim=1024's quality (0.906 vs 0.947), compared to only 70% before fine-tuning.
| Metric | Base | Stage 2 | Delta |
|---|---|---|---|
| NDCG@10 | 0.8612 | 0.9465 | +0.0853 |
| MRR@10 | 0.8315 | 0.9315 | +0.1000 |
| MAP@100 | 0.8336 | 0.9319 | +0.0983 |
| Accuracy@1 | 0.7618 | 0.8912 | +0.1294 |
| Accuracy@10 | 0.9529 | 0.9912 | +0.0383 |
| Recall@10 | 0.9529 | 0.9912 | +0.0383 |
pip install sentence-transformers
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("DanielNoumon/multilingual-e5-large-ai-act-nl")
# Encode queries and passages with prefixes
queries = ["query: What are the obligations for high-risk AI systems?"]
passages = [
"passage: High-risk AI systems must comply with requirements in Chapter III...",
"passage: The AI Act defines prohibited practices in Article 5..."
]
query_emb = model.encode(queries)
passage_emb = model.encode(passages)
# Compute similarity
from sentence_transformers.util import cos_sim
scores = cos_sim(query_emb, passage_emb)
# Encode with full 1024 dimensions
embeddings_1024 = model.encode(queries)
# Truncate to 256 dimensions for faster search
embeddings_256 = embeddings_1024[:, :256]
# Or specify dimension at encoding time
model.truncate_dim = 256
embeddings_256 = model.encode(queries)
Speed vs quality tradeoff:
This model requires query: and passage: prefixes (inherited from multilingual-e5-large):
# ??? Correct
queries = ["query: your question here"]
passages = ["passage: your document here"]
# ??? Wrong (will degrade performance)
queries = ["your question here"]
passages = ["your document here"]
MatryoshkaLoss(MultipleNegativesRankingLoss)MIT
If you use this model, please cite the base model and training frameworks:
@misc{wang2024multilingual,
title={Multilingual E5 Text Embeddings: A Technical Report},
author={Liang Wang and Nan Yang and Xiaolong Huang and Linjun Yang and Rangan Majumder and Furu Wei},
year={2024},
eprint={2402.05672},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
year = "2019",
url = "https://arxiv.org/abs/1908.10084",
}
@misc{kusupati2024matryoshka,
title={Matryoshka Representation Learning},
author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
year={2024},
eprint={2205.13147},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
Base model
intfloat/multilingual-e5-large