BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation
Paper
•
2402.03216
•
Published
•
6
This is a Turkish semantic textual similarity model fine-tuned from BAAI/bge-m3 on the Turkish STS-B dataset using AnglELoss (Angle-optimized Embeddings). The model excels at measuring the semantic similarity between Turkish sentence pairs, achieving state-of-the-art performance on the Turkish STS-B benchmark.
Best Model: Epoch 1.0 (Step 45)
| Metric | Score |
|---|---|
| Spearman Correlation | 0.8629 (86.29%) |
| Pearson Correlation | 0.8575 (85.75%) |
| Validation Loss | 5.682 |
Best checkpoint saved at step 45 (epoch 1.0) based on validation loss
| Step | Epoch | Training Loss | Validation Loss | Spearman | Pearson |
|---|---|---|---|---|---|
| 10 | 0.22 | 7.2492 | - | - | - |
| 15 | 0.33 | - | 6.8784 | 0.8359 | 0.8322 |
| 30 | 0.67 | 6.0701 | 5.8729 | 0.8340 | 0.8355 |
| 45 | 1.0 | - | 5.682 | 0.8535 | 0.8430 |
| 60 | 1.33 | 5.5751 | 5.7641 | 0.8572 | 0.8524 |
| 105 | 2.33 | 5.3594 | 6.0607 | 0.8629 | 0.8551 |
| 150 | 3.33 | 5.1111 | 6.1735 | 0.8634 | 0.8586 |
| 165 | 3.67 | - | 6.2597 | 0.8636 | 0.8571 |
| 225 | 5.0 | - | 6.5089 | 0.8629 | 0.8575 |
Bold row indicates the best checkpoint selected by early stopping
AnglELoss Advantages:
| JobID | JobName | Account | Partition | State | Start | End | Node | GPUs | Duration |
|---|---|---|---|---|---|---|---|---|---|
| 31478447 | bgem3-base-stsb | ehpc317 | acc | COMPLETED | Nov 3 13:59:58 | Nov 3 14:07:37 | as07r1b16 | 4 | 0.13h |
SentenceTransformer(
(0): Transformer({
'max_seq_length': 1024,
'do_lower_case': False,
'architecture': 'XLMRobertaModel'
})
(1): Pooling({
'word_embedding_dimension': 1024,
'pooling_mode_mean_tokens': True,
'pooling_mode_cls_token': False,
'pooling_mode_max_tokens': False,
'pooling_mode_mean_sqrt_len_tokens': False,
'pooling_mode_weightedmean_tokens': False,
'pooling_mode_lasttoken': False,
'include_prompt': True
})
(2): Normalize()
)
Each training example consists of:
| Sentence 1 | Sentence 2 | Score |
|---|---|---|
| Bir uçak kalkıyor. | Bir uçak havalanıyor. | 0.2 |
| Bir adam büyük bir flüt çalıyor. | Bir adam flüt çalıyor. | 0.152 |
| Bir adam pizzanın üzerine rendelenmiş peynir serpiyor. | Bir adam pişmemiş bir pizzanın üzerine rendelenmiş peynir serpiyor. | 0.152 |
This model is specifically optimized for:
pip install -U sentence-transformers
from sentence_transformers import SentenceTransformer, util
# Load the model
model = SentenceTransformer("newmindai/bge-m3-stsb-turkish", trust_remote_code=True)
# Turkish sentence pairs
sentence_pairs = [
["Bir uçak kalkıyor.", "Bir uçak havalanıyor."],
["Bir adam flüt çalıyor.", "Bir kadın zencefil dilimliyor."],
["Bir çocuk sahilde oynuyor.", "Küçük bir çocuk kumda oynuyor."]
]
# Compute similarity scores
for sent1, sent2 in sentence_pairs:
emb1 = model.encode(sent1, convert_to_tensor=True)
emb2 = model.encode(sent2, convert_to_tensor=True)
similarity = util.pytorch_cos_sim(emb1, emb2).item()
print(f"Similarity: {similarity:.4f}")
print(f" - '{sent1}'")
print(f" - '{sent2}'")
print()
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("newmindai/bge-m3-stsb-turkish", trust_remote_code=True)
# Turkish sentences
sentences = [
"Bir adam çiftliğinde çalışıyor.",
"Yaşlı bir adam çiftliğinde çalışırken bir inek onu tekmeler.",
"Bir kedi yavrusu yürüyor.",
"İki Hintli kadın sahilde duruyor."
]
# Encode sentences
embeddings = model.encode(sentences)
print(f"Embeddings shape: {embeddings.shape}")
# Output: (4, 1024)
# Compute similarity matrix
similarities = model.similarity(embeddings, embeddings)
print("Similarity matrix:")
print(similarities)
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer("newmindai/bge-m3-stsb-turkish", trust_remote_code=True)
# Query and corpus
query = "Bir adam çiftlikte çalışıyor."
corpus = [
"Yaşlı bir adam çiftliğinde çalışırken bir inek onu tekmeler.",
"Bir kedi yavrusu yürüyor.",
"Bir kadın kumu kazıyor.",
"Kayalık bir deniz kıyısında bir adam ve köpek.",
"İki Hintli kadın sahilde iki Hintli kızla birlikte duruyor."
]
# Encode
query_emb = model.encode(query, convert_to_tensor=True)
corpus_emb = model.encode(corpus, convert_to_tensor=True)
# Find most similar
hits = util.semantic_search(query_emb, corpus_emb, top_k=3)[0]
print(f"Query: {query}\n")
print("Top 3 most similar sentences:")
for hit in hits:
print(f"{hit['score']:.4f}: {corpus[hit['corpus_id']]}")
| Parameter | Value |
|---|---|
| Per-device train batch size | 8 |
| Number of GPUs | 4 |
| Physical batch size | 32 |
| Gradient accumulation steps | 4 |
| Effective batch size | 128 |
| Learning rate | 5e-05 |
| Weight decay | 0.01 |
| Warmup steps | 89 |
| LR scheduler | linear |
| Max gradient norm | 1.0 |
| Num train epochs | 5 |
| Save steps | 45 |
| Eval steps | 15 |
| Logging steps | 10 |
| AnglELoss scale | 20.0 |
| Batch sampler | batch_sampler |
| Load best model at end | True |
| Optimizer | adamw_torch_fused |
@inproceedings{li-li-2024-aoe,
title = "{A}o{E}: Angle-optimized Embeddings for Semantic Textual Similarity",
author = "Li, Xianming and Li, Jing",
year = "2024",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.acl-long.101/",
doi = "10.18653/v1/2024.acl-long.101"
}
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
@misc{bge-m3,
title={BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation},
author={Jianlv Chen and Shitao Xiao and Peitian Zhang and Kun Luo and Defu Lian and Zheng Liu},
year={2024},
eprint={2402.03216},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
@misc{stsb-deepl-tr,
title={Turkish STS-B Dataset (DeepL Translation)},
author={NewMind AI},
year={2024},
url={https://huggingface.co/datasets/newmindai/stsb-deepl-tr}
}
This model is licensed under the Apache 2.0 License.
Base model
BAAI/bge-m3