Spaces:

trmteb
/

README

Running

App Files Files Community

TR-MTEB commited on Jan 29

Commit

927dc73

verified ·

1 Parent(s): 54360b9

Update README.md

Browse files

Files changed (1) hide show

README.md +94 -4

README.md CHANGED Viewed

@@ -1,10 +1,100 @@
 ---
-title: README
-emoji: 📊
 colorFrom: purple
 colorTo: blue
 sdk: static
-pinned: false
 ---
-Edit this `README.md` markdown file to author your organization card.

 ---
+title: TR-MTEB
+emoji: 🇹🇷📊
 colorFrom: purple
 colorTo: blue
 sdk: static
+pinned: true
 ---
+# TR-MTEB: Turkish Massive Text Embedding Benchmark
+Welcome to the official Hugging Face organization for **TR-MTEB**,
+the first large-scale and task-diverse benchmark for evaluating **Turkish sentence embedding models**.
+---
+## 📌 Paper
+**TR-MTEB: A Comprehensive Benchmark and Embedding Model Suite for Turkish Sentence Representations**
+Mehmet Selman Baysan, Tunga Gungor
+*Findings of EMNLP 2025*
+- 📄 ACL Anthology: https://aclanthology.org/2025.findings-emnlp.471/
+- DOI: https://doi.org/10.18653/v1/2025.findings-emnlp.471
+> We introduce TR-MTEB, the first comprehensive benchmark for Turkish sentence representations, covering six core embedding evaluation tasks and 26 datasets.
+---
+## 🔍 Benchmark Overview
+TR-MTEB provides evaluation across **6 major embedding task categories**:
+- **Classification**
+- **Clustering**
+- **Pair Classification**
+- **Retrieval**
+- **Bitext Mining**
+- **Semantic Textual Similarity (STS)**
+📊 Total datasets included: **26**
+🌍 Combination of native Turkish + high-quality translated datasets
+---
+## 🧠 Turkish Embedding Models
+To complement the benchmark, we also release Turkish-specific embedding models trained on:
+- **34.2 million weakly supervised Turkish sentence pairs**
+- Contrastive pretraining + supervised fine-tuning
+These models achieve strong performance and significantly outperform monolingual baselines.
+---
+## 📂 Released Resources
+This organization hosts:
+✅ Benchmark datasets
+✅ Evaluation pipeline
+✅ Turkish embedding model suite
+✅ Training corpus and scripts (where applicable)
+All resources are released publicly to support research in:
+- Turkish NLP
+- Low-resource language embeddings
+- Multilingual benchmark development
+---
+## 🌟 Citation
+If you use TR-MTEB in your work, please cite:
+```bibtex
+@inproceedings{baysan-gungor-2025-tr,
+  title = "{TR}-{MTEB}: A Comprehensive Benchmark and Embedding Model Suite for {T}urkish Sentence Representations",
+  author = "Baysan, Mehmet Selman and Gungor, Tunga",
+  booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025",
+  month = nov,
+  year = "2025",
+  address = "Suzhou, China",
+  publisher = "Association for Computational Linguistics",
+  url = "https://aclanthology.org/2025.findings-emnlp.471/",
+  doi = "10.18653/v1/2025.findings-emnlp.471",
+  pages = "8867--8887"
+}
+🤝 Contact & Contributions
+We welcome contributions, new datasets, and collaborations.
+Author: Mehmet Selman Baysan
+Organization: TR-MTEB Project
+Feel free to open issues or discussions on Hugging Face.
+🇹🇷 Building better embedding benchmarks for Turkish and low-resource languages.