TR-MTEB commited on
Commit
927dc73
Β·
verified Β·
1 Parent(s): 54360b9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +94 -4
README.md CHANGED
@@ -1,10 +1,100 @@
1
  ---
2
- title: README
3
- emoji: πŸ“Š
4
  colorFrom: purple
5
  colorTo: blue
6
  sdk: static
7
- pinned: false
8
  ---
9
 
10
- Edit this `README.md` markdown file to author your organization card.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: TR-MTEB
3
+ emoji: πŸ‡ΉπŸ‡·πŸ“Š
4
  colorFrom: purple
5
  colorTo: blue
6
  sdk: static
7
+ pinned: true
8
  ---
9
 
10
+ # TR-MTEB: Turkish Massive Text Embedding Benchmark
11
+
12
+ Welcome to the official Hugging Face organization for **TR-MTEB**,
13
+ the first large-scale and task-diverse benchmark for evaluating **Turkish sentence embedding models**.
14
+
15
+ ---
16
+
17
+ ## πŸ“Œ Paper
18
+
19
+ **TR-MTEB: A Comprehensive Benchmark and Embedding Model Suite for Turkish Sentence Representations**
20
+ Mehmet Selman Baysan, Tunga Gungor
21
+ *Findings of EMNLP 2025*
22
+
23
+ - πŸ“„ ACL Anthology: https://aclanthology.org/2025.findings-emnlp.471/
24
+ - DOI: https://doi.org/10.18653/v1/2025.findings-emnlp.471
25
+
26
+ > We introduce TR-MTEB, the first comprehensive benchmark for Turkish sentence representations, covering six core embedding evaluation tasks and 26 datasets.
27
+
28
+ ---
29
+
30
+ ## πŸ” Benchmark Overview
31
+
32
+ TR-MTEB provides evaluation across **6 major embedding task categories**:
33
+
34
+ - **Classification**
35
+ - **Clustering**
36
+ - **Pair Classification**
37
+ - **Retrieval**
38
+ - **Bitext Mining**
39
+ - **Semantic Textual Similarity (STS)**
40
+
41
+ πŸ“Š Total datasets included: **26**
42
+ 🌍 Combination of native Turkish + high-quality translated datasets
43
+
44
+ ---
45
+
46
+ ## 🧠 Turkish Embedding Models
47
+
48
+ To complement the benchmark, we also release Turkish-specific embedding models trained on:
49
+
50
+ - **34.2 million weakly supervised Turkish sentence pairs**
51
+ - Contrastive pretraining + supervised fine-tuning
52
+
53
+ These models achieve strong performance and significantly outperform monolingual baselines.
54
+
55
+ ---
56
+
57
+ ## πŸ“‚ Released Resources
58
+
59
+ This organization hosts:
60
+
61
+ βœ… Benchmark datasets
62
+ βœ… Evaluation pipeline
63
+ βœ… Turkish embedding model suite
64
+ βœ… Training corpus and scripts (where applicable)
65
+
66
+ All resources are released publicly to support research in:
67
+
68
+ - Turkish NLP
69
+ - Low-resource language embeddings
70
+ - Multilingual benchmark development
71
+
72
+ ---
73
+
74
+ ## 🌟 Citation
75
+
76
+ If you use TR-MTEB in your work, please cite:
77
+
78
+ ```bibtex
79
+ @inproceedings{baysan-gungor-2025-tr,
80
+ title = "{TR}-{MTEB}: A Comprehensive Benchmark and Embedding Model Suite for {T}urkish Sentence Representations",
81
+ author = "Baysan, Mehmet Selman and Gungor, Tunga",
82
+ booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025",
83
+ month = nov,
84
+ year = "2025",
85
+ address = "Suzhou, China",
86
+ publisher = "Association for Computational Linguistics",
87
+ url = "https://aclanthology.org/2025.findings-emnlp.471/",
88
+ doi = "10.18653/v1/2025.findings-emnlp.471",
89
+ pages = "8867--8887"
90
+ }
91
+ 🀝 Contact & Contributions
92
+ We welcome contributions, new datasets, and collaborations.
93
+
94
+ Author: Mehmet Selman Baysan
95
+
96
+ Organization: TR-MTEB Project
97
+
98
+ Feel free to open issues or discussions on Hugging Face.
99
+
100
+ πŸ‡ΉπŸ‡· Building better embedding benchmarks for Turkish and low-resource languages.