File size: 4,021 Bytes
74f2384
 
 
 
 
 
 
 
 
 
 
 
ca41c16
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
---
language:
- tr
tags:
- tokenizer
- morphology
- turkish
- nlp
license: mit
library_name: turk-tokenizer
---

# TurkTokenizer

**Turkish morphological tokenizer — TR-MMLU world record 92%**

TurkTokenizer performs linguistically-aware tokenization of Turkish text using morphological rules. Unlike BPE-based tokenizers, it produces meaningful morphological units (roots and suffixes) aligned with Turkish grammar.

## Installation

```bash
pip install git+https://huggingface.co/Ethosoft/turk-tokenizer
```

**Java is required** (for Zemberek morphological analysis):

| OS | Command |
|---|---|
| Ubuntu / Debian | `sudo apt install default-jre` |
| Fedora / RHEL | `sudo dnf install java-latest-openjdk` |
| macOS | `brew install openjdk` |
| Windows | `winget install Microsoft.OpenJDK.21` |

## Quick Start

```python
from turk_tokenizer import TurkTokenizer

tok = TurkTokenizer()
tokens = tok("İstanbul'da meeting'e katılamadım")

for t in tokens:
    print(t["token"], t["token_type"], t["morph_pos"])
```

Output:
```
<uppercase_word>  ROOT    0
 istanbul         ROOT    0
da               SUFFIX  1
 meeting         FOREIGN 0
e                SUFFIX  1
 katılama        ROOT    0
dı               SUFFIX  1
m                SUFFIX  2
```

## Output Fields

Each token is a dict with the following guaranteed fields:

| Field | Type | Description |
|---|---|---|
| `token` | `str` | Token string (leading space = word-initial) |
| `token_type` | `str` | See types below |
| `morph_pos` | `int` | `0` = root/word-initial, `1` = first suffix, `2` = second… |

### Token Types

| Type | Description |
|---|---|
| `ROOT` | Turkish root word |
| `SUFFIX` | Turkish morphological suffix |
| `FOREIGN` | Foreign/loanword root (e.g. "meeting", "zoom") |
| `BPE` | Unknown subword (fallback) |
| `PUNCT` | Punctuation mark |
| `NUM` | Number |
| `DATE` | Date |
| `UNIT` | Measurement unit |
| `URL` | Web URL |
| `MENTION` | @username |
| `HASHTAG` | #topic |
| `EMOJI` | Emoji |

### Optional Metadata Fields

| Field | Description |
|---|---|
| `_canonical` | Canonical morpheme ID (e.g. `"PL"`, `"ACC"`, `"DAT"`) |
| `_suffix_label` | Detailed morphological label (e.g. `"-PL+ACC"`) |
| `_foreign` | `True` if foreign root |
| `_caps` | `True` if originally ALL CAPS |
| `_domain` | `True` if medical/sports/tourism domain |
| `_compound` | `True` if compound word |
| `_parts` | Compound word parts |
| `_expansion` | Acronym expansion (e.g. `"CMV"``"Sitomegalovirüs"`) |
| `_pos` | POS tag from Zemberek (Noun, Verb, Adj…) |
| `_lemma` | Lemma from Zemberek |
| `_disambiguated` | `True` if context disambiguation was applied |
| `_root_corrected` | `True` if root was corrected by Zemberek |

## Batch Tokenization

```python
texts = ["Ankara'da kar yağıyor.", "Meeting'e katılacak mısın?"]
results = tok.batch_tokenize(texts, workers=4)
```

## Statistics

```python
tokens = tok("Türk dili zengin bir morfolojiye sahiptir.")
s = tok.stats(tokens)
print(f"TR coverage: {s['tr_pct']}%")
```

## Morphological Fixes Applied

1. **ALL CAPS**`"İSTANBUL"` → 2 tokens instead of 16
2. **Apostrophe splitting**`"meeting'e"``[meeting:FOREIGN][e:SUFFIX]`
3. **BPE→SUFFIX** — 260+ suffix patterns reclassified
4. **Zemberek root validation** — phonetic root correction (`"gök"``"göğüs"`)
5. **Punctuation** — classified as PUNCT (counted in TR coverage)
6. **Domain vocabulary** — 500+ medical/sports/tourism roots
7. **TDK FOREIGN detection** — 76K+ Turkish words used as reference
8. **Special token normalization** — NUM, DATE, URL, MENTION, HASHTAG, EMOJI
9. **Allomorph canonicalization**`"lar"/"ler"``PL`, `"dan"/"den"``ABL`
10. **Compound decomposition**`"başbakan"``["baş", "bakan"]`
11. **Acronym expansion**`"CMV"``"Sitomegalovirüs"`
12. **Context disambiguation** — Zemberek sentence-level POS selection

## Benchmark

| Benchmark | Score |
|---|---|
| TR-MMLU | **92%** (world record) |

## License

MIT