Turkish Semantic Similarity Model - BGE-M3 (STS-B Fine-tuned)

This is a Turkish semantic textual similarity model fine-tuned from BAAI/bge-m3 on the Turkish STS-B dataset using AnglELoss (Angle-optimized Embeddings). The model excels at measuring the semantic similarity between Turkish sentence pairs, achieving state-of-the-art performance on the Turkish STS-B benchmark.

Overview

  • Base Model: BAAI/bge-m3 (1024-dimensional embeddings)
  • Training Task: Semantic Textual Similarity (STS)
  • Framework: Sentence Transformers (v5.1.1)
  • Language: Turkish (multilingual base model)
  • Dataset: Turkish STS-B (stsb-deepl-tr) - 5,749 training samples
  • Loss Function: AnglELoss (Angle-optimized with pairwise angle similarity)
  • Training Status: Completed (5 epochs)
  • Best Checkpoint: Epoch 1.0 (Step 45) - Validation Loss: 5.682
  • Final Spearman Correlation: 86.29%
  • Final Pearson Correlation: 85.75%
  • Context Length: 1024 tokens
  • Training Time: ~8 minutes (single task)

Performance Metrics

Final Evaluation Results

Best Model: Epoch 1.0 (Step 45)

Metric Score
Spearman Correlation 0.8629 (86.29%)
Pearson Correlation 0.8575 (85.75%)
Validation Loss 5.682

Best checkpoint saved at step 45 (epoch 1.0) based on validation loss

Training Progression

Step Epoch Training Loss Validation Loss Spearman Pearson
10 0.22 7.2492 - - -
15 0.33 - 6.8784 0.8359 0.8322
30 0.67 6.0701 5.8729 0.8340 0.8355
45 1.0 - 5.682 0.8535 0.8430
60 1.33 5.5751 5.7641 0.8572 0.8524
105 2.33 5.3594 6.0607 0.8629 0.8551
150 3.33 5.1111 6.1735 0.8634 0.8586
165 3.67 - 6.2597 0.8636 0.8571
225 5.0 - 6.5089 0.8629 0.8575

Bold row indicates the best checkpoint selected by early stopping

Training Infrastructure

Hardware Configuration

  • Nodes: 1
  • Node Name: as07r1b16
  • GPUs per Node: 4
  • Total GPUs: 4
  • CPUs: Not specified
  • Node Hours: ~0.13 hours (8 minutes)
  • GPU Type: NVIDIA (MareNostrum 5 ACC Partition)
  • Training Type: Multi-GPU with DataParallel (DP)

Training Statistics

  • Total Training Steps: 225
  • Training Samples: 5,749 (Turkish STS-B pairs)
  • Evaluation Samples: 1,379 (Turkish STS-B pairs)
  • Final Average Loss: 5.463
  • Training Time: ~6.5 minutes (390 seconds)
  • Samples/Second: 73.68
  • Steps/Second: 0.577

Training Configuration

Batch Configuration

  • Per-Device Batch Size: 8 (per GPU)
  • Number of GPUs: 4
  • Physical Batch Size: 32 (4 GPUs × 8 per-device)
  • Gradient Accumulation Steps: 4
  • Effective Batch Size: 128 (32 physical × 4 accumulation)
  • Samples per Step: 32

Loss Function

  • Type: AnglELoss (Angle-optimized Embeddings)
  • Implementation: Cosine Similarity Loss with angle optimization
  • Scale: 20.0 (temperature parameter)
  • Similarity Function: pairwise_angle_sim
  • Task: Regression (predicting similarity scores 0.0-1.0)

AnglELoss Advantages:

  1. Angle Optimization: Optimizes the angle between embeddings rather than raw cosine similarity
  2. Better Geometric Properties: Encourages uniform distribution on the unit hypersphere
  3. Improved Discrimination: Better separation between similar and dissimilar pairs
  4. Temperature Scaling: Scale parameter (20.0) controls the sharpness of similarity distribution

Optimization

  • Optimizer: AdamW (fused)
  • Base Learning Rate: 5e-05
  • Learning Rate Scheduler: Linear with warmup
  • Warmup Steps: 89
  • Weight Decay: 0.01
  • Max Gradient Norm: 1.0
  • Mixed Precision: Disabled

Checkpointing & Evaluation

  • Save Strategy: Every 45 steps
  • Evaluation Strategy: Every 15 steps
  • Logging Steps: 10
  • Save Total Limit: 3 checkpoints
  • Best Model Selection: Based on validation loss (lower is better)
  • Load Best Model at End: True

Job Details

JobID JobName Account Partition State Start End Node GPUs Duration
31478447 bgem3-base-stsb ehpc317 acc COMPLETED Nov 3 13:59:58 Nov 3 14:07:37 as07r1b16 4 0.13h

Model Architecture

SentenceTransformer(
  (0): Transformer({
    'max_seq_length': 1024,
    'do_lower_case': False,
    'architecture': 'XLMRobertaModel'
  })
  (1): Pooling({
    'word_embedding_dimension': 1024,
    'pooling_mode_mean_tokens': True,
    'pooling_mode_cls_token': False,
    'pooling_mode_max_tokens': False,
    'pooling_mode_mean_sqrt_len_tokens': False,
    'pooling_mode_weightedmean_tokens': False,
    'pooling_mode_lasttoken': False,
    'include_prompt': True
  })
  (2): Normalize()
)

Training Dataset

stsb-deepl-tr

  • Dataset: stsb-deepl-tr
  • Training Size: 5,749 sentence pairs
  • Evaluation Size: 1,379 sentence pairs
  • Task: Semantic Textual Similarity (regression)
  • Score Range: 0.0 (completely dissimilar) to 5.0 (semantically equivalent)
  • Normalized Range: 0.0 to 1.0 (divided by 5.0 during preprocessing)
  • Average Sentence Length: ~10-15 tokens per sentence

Data Format

Each training example consists of:

  • Sentence 1: Turkish sentence (6-30 tokens)
  • Sentence 2: Turkish sentence (6-26 tokens)
  • Similarity Score: Float value 0.0-1.0 (normalized from 0-5 scale)

Sample Data

Sentence 1 Sentence 2 Score
Bir uçak kalkıyor. Bir uçak havalanıyor. 0.2
Bir adam büyük bir flüt çalıyor. Bir adam flüt çalıyor. 0.152
Bir adam pizzanın üzerine rendelenmiş peynir serpiyor. Bir adam pişmemiş bir pizzanın üzerine rendelenmiş peynir serpiyor. 0.152

Capabilities

This model is specifically optimized for:

  • Semantic Similarity Scoring: Predicting similarity scores between Turkish sentence pairs
  • Paraphrase Detection: Identifying paraphrases and semantically equivalent sentences
  • Duplicate Detection: Finding duplicate or near-duplicate Turkish content
  • Question-Answer Matching: Matching questions with semantically similar answers
  • Document Similarity: Comparing semantic similarity of Turkish documents
  • Sentence Clustering: Grouping semantically similar Turkish sentences
  • Textual Entailment: Understanding semantic relationships between sentences

Usage

Installation

pip install -U sentence-transformers

Semantic Similarity Scoring

from sentence_transformers import SentenceTransformer, util

# Load the model
model = SentenceTransformer("newmindai/bge-m3-stsb-turkish", trust_remote_code=True)

# Turkish sentence pairs
sentence_pairs = [
    ["Bir uçak kalkıyor.", "Bir uçak havalanıyor."],
    ["Bir adam flüt çalıyor.", "Bir kadın zencefil dilimliyor."],
    ["Bir çocuk sahilde oynuyor.", "Küçük bir çocuk kumda oynuyor."]
]

# Compute similarity scores
for sent1, sent2 in sentence_pairs:
    emb1 = model.encode(sent1, convert_to_tensor=True)
    emb2 = model.encode(sent2, convert_to_tensor=True)
    
    similarity = util.pytorch_cos_sim(emb1, emb2).item()
    print(f"Similarity: {similarity:.4f}")
    print(f"  - '{sent1}'")
    print(f"  - '{sent2}'")
    print()

Batch Encoding

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("newmindai/bge-m3-stsb-turkish", trust_remote_code=True)

# Turkish sentences
sentences = [
    "Bir adam çiftliğinde çalışıyor.",
    "Yaşlı bir adam çiftliğinde çalışırken bir inek onu tekmeler.",
    "Bir kedi yavrusu yürüyor.",
    "İki Hintli kadın sahilde duruyor."
]

# Encode sentences
embeddings = model.encode(sentences)
print(f"Embeddings shape: {embeddings.shape}")
# Output: (4, 1024)

# Compute similarity matrix
similarities = model.similarity(embeddings, embeddings)
print("Similarity matrix:")
print(similarities)

Finding Most Similar Sentences

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("newmindai/bge-m3-stsb-turkish", trust_remote_code=True)

# Query and corpus
query = "Bir adam çiftlikte çalışıyor."
corpus = [
    "Yaşlı bir adam çiftliğinde çalışırken bir inek onu tekmeler.",
    "Bir kedi yavrusu yürüyor.",
    "Bir kadın kumu kazıyor.",
    "Kayalık bir deniz kıyısında bir adam ve köpek.",
    "İki Hintli kadın sahilde iki Hintli kızla birlikte duruyor."
]

# Encode
query_emb = model.encode(query, convert_to_tensor=True)
corpus_emb = model.encode(corpus, convert_to_tensor=True)

# Find most similar
hits = util.semantic_search(query_emb, corpus_emb, top_k=3)[0]

print(f"Query: {query}\n")
print("Top 3 most similar sentences:")
for hit in hits:
    print(f"{hit['score']:.4f}: {corpus[hit['corpus_id']]}")

Training Details

Complete Hyperparameters

Parameter Value
Per-device train batch size 8
Number of GPUs 4
Physical batch size 32
Gradient accumulation steps 4
Effective batch size 128
Learning rate 5e-05
Weight decay 0.01
Warmup steps 89
LR scheduler linear
Max gradient norm 1.0
Num train epochs 5
Save steps 45
Eval steps 15
Logging steps 10
AnglELoss scale 20.0
Batch sampler batch_sampler
Load best model at end True
Optimizer adamw_torch_fused

Framework Versions

  • Python: 3.10.12
  • Sentence Transformers: 5.1.1
  • PyTorch: 2.8.0+cu128
  • Transformers: 4.57.0
  • CUDA: 12.8
  • Accelerate: 1.10.1
  • Datasets: 4.2.0
  • Tokenizers: 0.22.1

Use Cases

  • Chatbot Response Matching: Find the most semantically similar pre-defined response for user queries
  • FAQ Search: Match user questions to the most relevant FAQ entries
  • Content Recommendation: Recommend articles or documents with similar semantic content
  • Plagiarism Detection: Identify semantically similar text for academic integrity checks
  • Customer Support: Match support tickets to similar previously resolved issues
  • Document Clustering: Group documents by semantic similarity for organization
  • Paraphrase Mining: Automatically detect paraphrases in large Turkish text corpora
  • Semantic Search: Build semantic search engines for Turkish content
  • Question Answering: Match questions to semantically relevant answer candidates
  • Text Summarization: Identify redundant sentences for summary generation

Citation

AnglELoss

@inproceedings{li-li-2024-aoe,
    title = "{A}o{E}: Angle-optimized Embeddings for Semantic Textual Similarity",
    author = "Li, Xianming and Li, Jing",
    year = "2024",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.acl-long.101/",
    doi = "10.18653/v1/2024.acl-long.101"
}

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

Base Model (BGE-M3)

@misc{bge-m3,
    title={BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation},
    author={Jianlv Chen and Shitao Xiao and Peitian Zhang and Kun Luo and Defu Lian and Zheng Liu},
    year={2024},
    eprint={2402.03216},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Dataset

@misc{stsb-deepl-tr,
    title={Turkish STS-B Dataset (DeepL Translation)},
    author={NewMind AI},
    year={2024},
    url={https://huggingface.co/datasets/newmindai/stsb-deepl-tr}
}

License

This model is licensed under the Apache 2.0 License.

Acknowledgments

  • Base Model: BAAI/bge-m3
  • Training Infrastructure: MareNostrum 5 Supercomputer (Barcelona Supercomputing Center)
  • Framework: Sentence Transformers by UKP Lab
  • Dataset: newmindai/stsb-deepl-tr
  • Loss Function: AnglELoss (Angle-optimized Embeddings)
  • Training Approach: Single-task fine-tuning on Turkish STS-B
Downloads last month
50
Safetensors
Model size
0.6B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for newmindai/bge-m3-stsb

Base model

BAAI/bge-m3
Finetuned
(366)
this model

Dataset used to train newmindai/bge-m3-stsb

Papers for newmindai/bge-m3-stsb

Evaluation results