Turkish Semantic Similarity Model - BGE-M3 (STS-B Fine-tuned)

This is a Turkish semantic textual similarity model fine-tuned from BAAI/bge-m3 on the Turkish STS-B dataset using AnglELoss (Angle-optimized Embeddings). The model excels at measuring the semantic similarity between Turkish sentence pairs, achieving state-of-the-art performance on the Turkish STS-B benchmark.

Overview

Base Model: BAAI/bge-m3 (1024-dimensional embeddings)
Training Task: Semantic Textual Similarity (STS)
Framework: Sentence Transformers (v5.1.1)
Language: Turkish (multilingual base model)
Dataset: Turkish STS-B (stsb-deepl-tr) - 5,749 training samples
Loss Function: AnglELoss (Angle-optimized with pairwise angle similarity)
Training Status: Completed (5 epochs)
Best Checkpoint: Epoch 1.0 (Step 45) - Validation Loss: 5.682
Final Spearman Correlation: 86.29%
Final Pearson Correlation: 85.75%
Context Length: 1024 tokens
Training Time: ~8 minutes (single task)

Performance Metrics

Final Evaluation Results

Best Model: Epoch 1.0 (Step 45)

Metric	Score
Spearman Correlation	0.8629 (86.29%)
Pearson Correlation	0.8575 (85.75%)
Validation Loss	5.682

Best checkpoint saved at step 45 (epoch 1.0) based on validation loss

Training Progression

Step	Epoch	Training Loss	Validation Loss	Spearman	Pearson
10	0.22	7.2492	-	-	-
15	0.33	-	6.8784	0.8359	0.8322
30	0.67	6.0701	5.8729	0.8340	0.8355
45	1.0	-	5.682	0.8535	0.8430
60	1.33	5.5751	5.7641	0.8572	0.8524
105	2.33	5.3594	6.0607	0.8629	0.8551
150	3.33	5.1111	6.1735	0.8634	0.8586
165	3.67	-	6.2597	0.8636	0.8571
225	5.0	-	6.5089	0.8629	0.8575

Bold row indicates the best checkpoint selected by early stopping

Training Infrastructure

Hardware Configuration

Nodes: 1
Node Name: as07r1b16
GPUs per Node: 4
Total GPUs: 4
CPUs: Not specified
Node Hours: ~0.13 hours (8 minutes)
GPU Type: NVIDIA (MareNostrum 5 ACC Partition)
Training Type: Multi-GPU with DataParallel (DP)

Training Statistics

Total Training Steps: 225
Training Samples: 5,749 (Turkish STS-B pairs)
Evaluation Samples: 1,379 (Turkish STS-B pairs)
Final Average Loss: 5.463
Training Time: ~6.5 minutes (390 seconds)
Samples/Second: 73.68
Steps/Second: 0.577

Training Configuration

Batch Configuration

Per-Device Batch Size: 8 (per GPU)
Number of GPUs: 4
Physical Batch Size: 32 (4 GPUs × 8 per-device)
Gradient Accumulation Steps: 4
Effective Batch Size: 128 (32 physical × 4 accumulation)
Samples per Step: 32

Loss Function

Type: AnglELoss (Angle-optimized Embeddings)
Implementation: Cosine Similarity Loss with angle optimization
Scale: 20.0 (temperature parameter)
Similarity Function: pairwise_angle_sim
Task: Regression (predicting similarity scores 0.0-1.0)

AnglELoss Advantages:

Angle Optimization: Optimizes the angle between embeddings rather than raw cosine similarity
Better Geometric Properties: Encourages uniform distribution on the unit hypersphere
Improved Discrimination: Better separation between similar and dissimilar pairs
Temperature Scaling: Scale parameter (20.0) controls the sharpness of similarity distribution

Optimization

Optimizer: AdamW (fused)
Base Learning Rate: 5e-05
Learning Rate Scheduler: Linear with warmup
Warmup Steps: 89
Weight Decay: 0.01
Max Gradient Norm: 1.0
Mixed Precision: Disabled

Checkpointing & Evaluation

Save Strategy: Every 45 steps
Evaluation Strategy: Every 15 steps
Logging Steps: 10
Save Total Limit: 3 checkpoints
Best Model Selection: Based on validation loss (lower is better)
Load Best Model at End: True

Job Details

JobID	JobName	Account	Partition	State	Start	End	Node	GPUs	Duration
31478447	bgem3-base-stsb	ehpc317	acc	COMPLETED	Nov 3 13:59:58	Nov 3 14:07:37	as07r1b16	4	0.13h

Model Architecture

SentenceTransformer(
  (0): Transformer({
    'max_seq_length': 1024,
    'do_lower_case': False,
    'architecture': 'XLMRobertaModel'
  })
  (1): Pooling({
    'word_embedding_dimension': 1024,
    'pooling_mode_mean_tokens': True,
    'pooling_mode_cls_token': False,
    'pooling_mode_max_tokens': False,
    'pooling_mode_mean_sqrt_len_tokens': False,
    'pooling_mode_weightedmean_tokens': False,
    'pooling_mode_lasttoken': False,
    'include_prompt': True
  })
  (2): Normalize()
)

Training Dataset

stsb-deepl-tr

Dataset: stsb-deepl-tr
Training Size: 5,749 sentence pairs
Evaluation Size: 1,379 sentence pairs
Task: Semantic Textual Similarity (regression)
Score Range: 0.0 (completely dissimilar) to 5.0 (semantically equivalent)
Normalized Range: 0.0 to 1.0 (divided by 5.0 during preprocessing)
Average Sentence Length: ~10-15 tokens per sentence

Data Format

Each training example consists of:

Sentence 1: Turkish sentence (6-30 tokens)
Sentence 2: Turkish sentence (6-26 tokens)
Similarity Score: Float value 0.0-1.0 (normalized from 0-5 scale)

Sample Data

Sentence 1	Sentence 2	Score
Bir uçak kalkıyor.	Bir uçak havalanıyor.	0.2
Bir adam büyük bir flüt çalıyor.	Bir adam flüt çalıyor.	0.152
Bir adam pizzanın üzerine rendelenmiş peynir serpiyor.	Bir adam pişmemiş bir pizzanın üzerine rendelenmiş peynir serpiyor.	0.152

Capabilities

This model is specifically optimized for:

Semantic Similarity Scoring: Predicting similarity scores between Turkish sentence pairs
Paraphrase Detection: Identifying paraphrases and semantically equivalent sentences
Duplicate Detection: Finding duplicate or near-duplicate Turkish content
Question-Answer Matching: Matching questions with semantically similar answers
Document Similarity: Comparing semantic similarity of Turkish documents
Sentence Clustering: Grouping semantically similar Turkish sentences
Textual Entailment: Understanding semantic relationships between sentences

Usage

Installation

pip install -U sentence-transformers

Semantic Similarity Scoring

from sentence_transformers import SentenceTransformer, util

# Load the model
model = SentenceTransformer("newmindai/bge-m3-stsb-turkish", trust_remote_code=True)

# Turkish sentence pairs
sentence_pairs = [
    ["Bir uçak kalkıyor.", "Bir uçak havalanıyor."],
    ["Bir adam flüt çalıyor.", "Bir kadın zencefil dilimliyor."],
    ["Bir çocuk sahilde oynuyor.", "Küçük bir çocuk kumda oynuyor."]
]

# Compute similarity scores
for sent1, sent2 in sentence_pairs:
    emb1 = model.encode(sent1, convert_to_tensor=True)
    emb2 = model.encode(sent2, convert_to_tensor=True)
    
    similarity = util.pytorch_cos_sim(emb1, emb2).item()
    print(f"Similarity: {similarity:.4f}")
    print(f"  - '{sent1}'")
    print(f"  - '{sent2}'")
    print()

Batch Encoding

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("newmindai/bge-m3-stsb-turkish", trust_remote_code=True)

# Turkish sentences
sentences = [
    "Bir adam çiftliğinde çalışıyor.",
    "Yaşlı bir adam çiftliğinde çalışırken bir inek onu tekmeler.",
    "Bir kedi yavrusu yürüyor.",
    "İki Hintli kadın sahilde duruyor."
]

# Encode sentences
embeddings = model.encode(sentences)
print(f"Embeddings shape: {embeddings.shape}")
# Output: (4, 1024)

# Compute similarity matrix
similarities = model.similarity(embeddings, embeddings)
print("Similarity matrix:")
print(similarities)

Finding Most Similar Sentences

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("newmindai/bge-m3-stsb-turkish", trust_remote_code=True)

# Query and corpus
query = "Bir adam çiftlikte çalışıyor."
corpus = [
    "Yaşlı bir adam çiftliğinde çalışırken bir inek onu tekmeler.",
    "Bir kedi yavrusu yürüyor.",
    "Bir kadın kumu kazıyor.",
    "Kayalık bir deniz kıyısında bir adam ve köpek.",
    "İki Hintli kadın sahilde iki Hintli kızla birlikte duruyor."
]

# Encode
query_emb = model.encode(query, convert_to_tensor=True)
corpus_emb = model.encode(corpus, convert_to_tensor=True)

# Find most similar
hits = util.semantic_search(query_emb, corpus_emb, top_k=3)[0]

print(f"Query: {query}\n")
print("Top 3 most similar sentences:")
for hit in hits:
    print(f"{hit['score']:.4f}: {corpus[hit['corpus_id']]}")

Training Details

Complete Hyperparameters

Parameter	Value
Per-device train batch size	8
Number of GPUs	4
Physical batch size	32
Gradient accumulation steps	4
Effective batch size	128
Learning rate	5e-05
Weight decay	0.01
Warmup steps	89
LR scheduler	linear
Max gradient norm	1.0
Num train epochs	5
Save steps	45
Eval steps	15
Logging steps	10
AnglELoss scale	20.0
Batch sampler	batch_sampler
Load best model at end	True
Optimizer	adamw_torch_fused

Framework Versions

Python: 3.10.12
Sentence Transformers: 5.1.1
PyTorch: 2.8.0+cu128
Transformers: 4.57.0
CUDA: 12.8
Accelerate: 1.10.1
Datasets: 4.2.0
Tokenizers: 0.22.1

Use Cases

Chatbot Response Matching: Find the most semantically similar pre-defined response for user queries
FAQ Search: Match user questions to the most relevant FAQ entries
Content Recommendation: Recommend articles or documents with similar semantic content
Plagiarism Detection: Identify semantically similar text for academic integrity checks
Customer Support: Match support tickets to similar previously resolved issues
Document Clustering: Group documents by semantic similarity for organization
Paraphrase Mining: Automatically detect paraphrases in large Turkish text corpora
Semantic Search: Build semantic search engines for Turkish content
Question Answering: Match questions to semantically relevant answer candidates
Text Summarization: Identify redundant sentences for summary generation

Citation

AnglELoss

@inproceedings{li-li-2024-aoe,
    title = "{A}o{E}: Angle-optimized Embeddings for Semantic Textual Similarity",
    author = "Li, Xianming and Li, Jing",
    year = "2024",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.acl-long.101/",
    doi = "10.18653/v1/2024.acl-long.101"
}

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

Base Model (BGE-M3)

@misc{bge-m3,
    title={BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation},
    author={Jianlv Chen and Shitao Xiao and Peitian Zhang and Kun Luo and Defu Lian and Zheng Liu},
    year={2024},
    eprint={2402.03216},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Dataset

@misc{stsb-deepl-tr,
    title={Turkish STS-B Dataset (DeepL Translation)},
    author={NewMind AI},
    year={2024},
    url={https://huggingface.co/datasets/newmindai/stsb-deepl-tr}
}

License

This model is licensed under the Apache 2.0 License.

Acknowledgments

Base Model: BAAI/bge-m3
Training Infrastructure: MareNostrum 5 Supercomputer (Barcelona Supercomputing Center)
Framework: Sentence Transformers by UKP Lab
Dataset: newmindai/stsb-deepl-tr
Loss Function: AnglELoss (Angle-optimized Embeddings)
Training Approach: Single-task fine-tuning on Turkish STS-B

Downloads last month: 32

Safetensors

Model size

0.6B params

Tensor type

F32

Model tree for newmindai/bge-m3-stsb

Base model

BAAI/bge-m3

Finetuned

(502)

this model

Dataset used to train newmindai/bge-m3-stsb

Papers for newmindai/bge-m3-stsb

BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation

Paper • 2402.03216 • Published Feb 5, 2024 • 10

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Paper • 1908.10084 • Published Aug 27, 2019 • 15

Evaluation results

Pearson Cosine on stsb-eval
self-reported

0.858
Spearman Cosine on stsb-eval
self-reported

0.863