Update README.md
Browse files
README.md
CHANGED
|
@@ -1,10 +1,100 @@
|
|
| 1 |
---
|
| 2 |
-
title:
|
| 3 |
-
emoji: π
|
| 4 |
colorFrom: purple
|
| 5 |
colorTo: blue
|
| 6 |
sdk: static
|
| 7 |
-
pinned:
|
| 8 |
---
|
| 9 |
|
| 10 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
title: TR-MTEB
|
| 3 |
+
emoji: πΉπ·π
|
| 4 |
colorFrom: purple
|
| 5 |
colorTo: blue
|
| 6 |
sdk: static
|
| 7 |
+
pinned: true
|
| 8 |
---
|
| 9 |
|
| 10 |
+
# TR-MTEB: Turkish Massive Text Embedding Benchmark
|
| 11 |
+
|
| 12 |
+
Welcome to the official Hugging Face organization for **TR-MTEB**,
|
| 13 |
+
the first large-scale and task-diverse benchmark for evaluating **Turkish sentence embedding models**.
|
| 14 |
+
|
| 15 |
+
---
|
| 16 |
+
|
| 17 |
+
## π Paper
|
| 18 |
+
|
| 19 |
+
**TR-MTEB: A Comprehensive Benchmark and Embedding Model Suite for Turkish Sentence Representations**
|
| 20 |
+
Mehmet Selman Baysan, Tunga Gungor
|
| 21 |
+
*Findings of EMNLP 2025*
|
| 22 |
+
|
| 23 |
+
- π ACL Anthology: https://aclanthology.org/2025.findings-emnlp.471/
|
| 24 |
+
- DOI: https://doi.org/10.18653/v1/2025.findings-emnlp.471
|
| 25 |
+
|
| 26 |
+
> We introduce TR-MTEB, the first comprehensive benchmark for Turkish sentence representations, covering six core embedding evaluation tasks and 26 datasets.
|
| 27 |
+
|
| 28 |
+
---
|
| 29 |
+
|
| 30 |
+
## π Benchmark Overview
|
| 31 |
+
|
| 32 |
+
TR-MTEB provides evaluation across **6 major embedding task categories**:
|
| 33 |
+
|
| 34 |
+
- **Classification**
|
| 35 |
+
- **Clustering**
|
| 36 |
+
- **Pair Classification**
|
| 37 |
+
- **Retrieval**
|
| 38 |
+
- **Bitext Mining**
|
| 39 |
+
- **Semantic Textual Similarity (STS)**
|
| 40 |
+
|
| 41 |
+
π Total datasets included: **26**
|
| 42 |
+
π Combination of native Turkish + high-quality translated datasets
|
| 43 |
+
|
| 44 |
+
---
|
| 45 |
+
|
| 46 |
+
## π§ Turkish Embedding Models
|
| 47 |
+
|
| 48 |
+
To complement the benchmark, we also release Turkish-specific embedding models trained on:
|
| 49 |
+
|
| 50 |
+
- **34.2 million weakly supervised Turkish sentence pairs**
|
| 51 |
+
- Contrastive pretraining + supervised fine-tuning
|
| 52 |
+
|
| 53 |
+
These models achieve strong performance and significantly outperform monolingual baselines.
|
| 54 |
+
|
| 55 |
+
---
|
| 56 |
+
|
| 57 |
+
## π Released Resources
|
| 58 |
+
|
| 59 |
+
This organization hosts:
|
| 60 |
+
|
| 61 |
+
β
Benchmark datasets
|
| 62 |
+
β
Evaluation pipeline
|
| 63 |
+
β
Turkish embedding model suite
|
| 64 |
+
β
Training corpus and scripts (where applicable)
|
| 65 |
+
|
| 66 |
+
All resources are released publicly to support research in:
|
| 67 |
+
|
| 68 |
+
- Turkish NLP
|
| 69 |
+
- Low-resource language embeddings
|
| 70 |
+
- Multilingual benchmark development
|
| 71 |
+
|
| 72 |
+
---
|
| 73 |
+
|
| 74 |
+
## π Citation
|
| 75 |
+
|
| 76 |
+
If you use TR-MTEB in your work, please cite:
|
| 77 |
+
|
| 78 |
+
```bibtex
|
| 79 |
+
@inproceedings{baysan-gungor-2025-tr,
|
| 80 |
+
title = "{TR}-{MTEB}: A Comprehensive Benchmark and Embedding Model Suite for {T}urkish Sentence Representations",
|
| 81 |
+
author = "Baysan, Mehmet Selman and Gungor, Tunga",
|
| 82 |
+
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025",
|
| 83 |
+
month = nov,
|
| 84 |
+
year = "2025",
|
| 85 |
+
address = "Suzhou, China",
|
| 86 |
+
publisher = "Association for Computational Linguistics",
|
| 87 |
+
url = "https://aclanthology.org/2025.findings-emnlp.471/",
|
| 88 |
+
doi = "10.18653/v1/2025.findings-emnlp.471",
|
| 89 |
+
pages = "8867--8887"
|
| 90 |
+
}
|
| 91 |
+
π€ Contact & Contributions
|
| 92 |
+
We welcome contributions, new datasets, and collaborations.
|
| 93 |
+
|
| 94 |
+
Author: Mehmet Selman Baysan
|
| 95 |
+
|
| 96 |
+
Organization: TR-MTEB Project
|
| 97 |
+
|
| 98 |
+
Feel free to open issues or discussions on Hugging Face.
|
| 99 |
+
|
| 100 |
+
πΉπ· Building better embedding benchmarks for Turkish and low-resource languages.
|