turk-tokenizer / README.md
nmstech's picture
Add HuggingFace metadata to README
74f2384 verified
---
language:
- tr
tags:
- tokenizer
- morphology
- turkish
- nlp
license: mit
library_name: turk-tokenizer
---
# TurkTokenizer
**Turkish morphological tokenizer — TR-MMLU world record 92%**
TurkTokenizer performs linguistically-aware tokenization of Turkish text using morphological rules. Unlike BPE-based tokenizers, it produces meaningful morphological units (roots and suffixes) aligned with Turkish grammar.
## Installation
```bash
pip install git+https://huggingface.co/Ethosoft/turk-tokenizer
```
**Java is required** (for Zemberek morphological analysis):
| OS | Command |
|---|---|
| Ubuntu / Debian | `sudo apt install default-jre` |
| Fedora / RHEL | `sudo dnf install java-latest-openjdk` |
| macOS | `brew install openjdk` |
| Windows | `winget install Microsoft.OpenJDK.21` |
## Quick Start
```python
from turk_tokenizer import TurkTokenizer
tok = TurkTokenizer()
tokens = tok("İstanbul'da meeting'e katılamadım")
for t in tokens:
print(t["token"], t["token_type"], t["morph_pos"])
```
Output:
```
<uppercase_word> ROOT 0
istanbul ROOT 0
da SUFFIX 1
meeting FOREIGN 0
e SUFFIX 1
katılama ROOT 0
dı SUFFIX 1
m SUFFIX 2
```
## Output Fields
Each token is a dict with the following guaranteed fields:
| Field | Type | Description |
|---|---|---|
| `token` | `str` | Token string (leading space = word-initial) |
| `token_type` | `str` | See types below |
| `morph_pos` | `int` | `0` = root/word-initial, `1` = first suffix, `2` = second… |
### Token Types
| Type | Description |
|---|---|
| `ROOT` | Turkish root word |
| `SUFFIX` | Turkish morphological suffix |
| `FOREIGN` | Foreign/loanword root (e.g. "meeting", "zoom") |
| `BPE` | Unknown subword (fallback) |
| `PUNCT` | Punctuation mark |
| `NUM` | Number |
| `DATE` | Date |
| `UNIT` | Measurement unit |
| `URL` | Web URL |
| `MENTION` | @username |
| `HASHTAG` | #topic |
| `EMOJI` | Emoji |
### Optional Metadata Fields
| Field | Description |
|---|---|
| `_canonical` | Canonical morpheme ID (e.g. `"PL"`, `"ACC"`, `"DAT"`) |
| `_suffix_label` | Detailed morphological label (e.g. `"-PL+ACC"`) |
| `_foreign` | `True` if foreign root |
| `_caps` | `True` if originally ALL CAPS |
| `_domain` | `True` if medical/sports/tourism domain |
| `_compound` | `True` if compound word |
| `_parts` | Compound word parts |
| `_expansion` | Acronym expansion (e.g. `"CMV"``"Sitomegalovirüs"`) |
| `_pos` | POS tag from Zemberek (Noun, Verb, Adj…) |
| `_lemma` | Lemma from Zemberek |
| `_disambiguated` | `True` if context disambiguation was applied |
| `_root_corrected` | `True` if root was corrected by Zemberek |
## Batch Tokenization
```python
texts = ["Ankara'da kar yağıyor.", "Meeting'e katılacak mısın?"]
results = tok.batch_tokenize(texts, workers=4)
```
## Statistics
```python
tokens = tok("Türk dili zengin bir morfolojiye sahiptir.")
s = tok.stats(tokens)
print(f"TR coverage: {s['tr_pct']}%")
```
## Morphological Fixes Applied
1. **ALL CAPS**`"İSTANBUL"` → 2 tokens instead of 16
2. **Apostrophe splitting**`"meeting'e"``[meeting:FOREIGN][e:SUFFIX]`
3. **BPE→SUFFIX** — 260+ suffix patterns reclassified
4. **Zemberek root validation** — phonetic root correction (`"gök"``"göğüs"`)
5. **Punctuation** — classified as PUNCT (counted in TR coverage)
6. **Domain vocabulary** — 500+ medical/sports/tourism roots
7. **TDK FOREIGN detection** — 76K+ Turkish words used as reference
8. **Special token normalization** — NUM, DATE, URL, MENTION, HASHTAG, EMOJI
9. **Allomorph canonicalization**`"lar"/"ler"``PL`, `"dan"/"den"``ABL`
10. **Compound decomposition**`"başbakan"``["baş", "bakan"]`
11. **Acronym expansion**`"CMV"``"Sitomegalovirüs"`
12. **Context disambiguation** — Zemberek sentence-level POS selection
## Benchmark
| Benchmark | Score |
|---|---|
| TR-MMLU | **92%** (world record) |
## License
MIT