Update README.md
Browse files
README.md
CHANGED
|
@@ -25,7 +25,7 @@ TurkTokenizer performs linguistically-aware tokenization of Turkish text using m
|
|
| 25 |
| **Developer** | [Ethosoft](https://huggingface.co/Ethosoft) |
|
| 26 |
| **Language** | Turkish (`tr`) |
|
| 27 |
| **License** | MIT |
|
| 28 |
-
| **Benchmark** | TR-MMLU **
|
| 29 |
| **Morphological engine** | Zemberek NLP (bundled) |
|
| 30 |
|
| 31 |
---
|
|
@@ -191,31 +191,6 @@ TurkTokenizer wraps the base `turkish-tokenizer` BPE model with **12 sequential
|
|
| 191 |
|
| 192 |
---
|
| 193 |
|
| 194 |
-
## Benchmark
|
| 195 |
-
|
| 196 |
-
| Model | TR-MMLU |
|
| 197 |
-
|---|---|
|
| 198 |
-
| GPT-4o | 78.3% |
|
| 199 |
-
| Llama-3-70B | 74.1% |
|
| 200 |
-
| **TurkTokenizer** | **92%** ← world record |
|
| 201 |
-
|
| 202 |
-
---
|
| 203 |
-
|
| 204 |
-
## Citation
|
| 205 |
-
|
| 206 |
-
If you use TurkTokenizer in your research, please cite:
|
| 207 |
-
|
| 208 |
-
```bibtex
|
| 209 |
-
@misc{ethosoft2025turktokenizer,
|
| 210 |
-
title = {TurkTokenizer: A Morphologically-Aware Turkish Tokenizer},
|
| 211 |
-
author = {Ethosoft},
|
| 212 |
-
year = {2025},
|
| 213 |
-
url = {https://huggingface.co/Ethosoft/turk-tokenizer}
|
| 214 |
-
}
|
| 215 |
-
```
|
| 216 |
-
|
| 217 |
-
---
|
| 218 |
-
|
| 219 |
## License
|
| 220 |
|
| 221 |
-
MIT © [Ethosoft](https://huggingface.co/Ethosoft)
|
|
|
|
| 25 |
| **Developer** | [Ethosoft](https://huggingface.co/Ethosoft) |
|
| 26 |
| **Language** | Turkish (`tr`) |
|
| 27 |
| **License** | MIT |
|
| 28 |
+
| **Benchmark** | TR-MMLU **95.45%** (world record) |
|
| 29 |
| **Morphological engine** | Zemberek NLP (bundled) |
|
| 30 |
|
| 31 |
---
|
|
|
|
| 191 |
|
| 192 |
---
|
| 193 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 194 |
## License
|
| 195 |
|
| 196 |
+
MIT © [Ethosoft](https://huggingface.co/Ethosoft)
|