File size: 6,274 Bytes
a0e8f24
 
 
 
 
 
 
 
47e9fd4
a0e8f24
cfffd93
47e9fd4
a0e8f24
 
cfffd93
a0e8f24
3a27be3
a0e8f24
cfffd93
a0e8f24
47e9fd4
 
 
 
 
 
 
3a27be3
b719e3c
47e9fd4
 
 
 
 
 
a0e8f24
 
cfffd93
a0e8f24
 
b719e3c
a0e8f24
47e9fd4
a0e8f24
cfffd93
47e9fd4
 
 
 
cfffd93
47e9fd4
 
 
 
 
 
 
 
 
 
 
 
 
cfffd93
47e9fd4
 
 
 
 
 
 
 
 
cfffd93
47e9fd4
 
 
 
 
 
a0e8f24
 
cfffd93
a0e8f24
cfffd93
a0e8f24
47e9fd4
 
a0e8f24
 
47e9fd4
 
 
 
 
 
 
a0e8f24
 
47e9fd4
864ffd2
47e9fd4
864ffd2
47e9fd4
864ffd2
47e9fd4
 
532470d
47e9fd4
532470d
 
 
 
 
 
 
 
864ffd2
47e9fd4
a0e8f24
 
 
47e9fd4
a0e8f24
 
 
47e9fd4
532470d
 
a0e8f24
 
 
47e9fd4
 
 
 
 
 
 
 
 
 
 
 
 
cfffd93
a0e8f24
 
 
 
 
cfffd93
 
47e9fd4
 
 
 
 
cfffd93
 
 
 
 
a0e8f24
47e9fd4
a0e8f24
47e9fd4
a0e8f24
cfffd93
a0e8f24
47e9fd4
 
cfffd93
 
 
47e9fd4
 
 
cfffd93
47e9fd4
cfffd93
47e9fd4
 
 
 
 
a0e8f24
 
 
cfffd93
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
---
language:
- tr
tags:
- tokenizer
- morphology
- turkish
- nlp
- transformers
license: mit
library_name: nedo-turkish-tokenizer
pipeline_tag: token-classification
---

# NedoTurkishTokenizer

**Turkish morphological tokenizer — TR-MMLU world record 92.64%**

NedoTurkishTokenizer performs linguistically-aware tokenization of Turkish text using morphological rules. Unlike BPE-based tokenizers, it produces meaningful morphological units (roots and suffixes) aligned with Turkish grammar, powered by [Zemberek NLP](https://github.com/ahmetaa/zemberek-nlp).

## Model Details

| | |
|---|---|
| **Developer** | [Ethosoft](https://huggingface.co/Ethosoft) |
| **Language** | Turkish (`tr`) |
| **License** | MIT |
| **Benchmark** | TR-MMLU **92.64%** (world record) |
| **Morphological engine** | zemberek-python |

---

## Use This Model

### Installation

```bash
pip install git+https://huggingface.co/Ethosoft/NedoTurkishTokenizer
```



---

### With Transformers (`AutoTokenizer`)

```python
from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("Ethosoft/NedoTurkishTokenizer", trust_remote_code=True)

out = tok("Türk dili, morfolojik açıdan zengin bir dildir.")
print(out["input_ids"])            # hash-stable int IDs
print(out["attention_mask"])       # [1, 1, 1, ...]
print(out["token_type_ids"])       # 0=root, 1=suffix, 2=bpe, 3=punct, 4=num, 5=url/social

for t in out["morphological_tokens"]:
    print(t["token"], t["token_type"], t["morph_pos"])
```

**Batch tokenization:**
```python
out = tok(["Türkçe metin.", "Another sentence with code-switching."])
# out["input_ids"]  -> list of lists
```

**Direct morphological tokenization:**
```python
tokens = tok.morphological_tokenize("Başbakan Ankara'da toplantı yaptı.")
for t in tokens:
    print(f"{t['token']:20s} {t['token_type']:8s} pos={t['morph_pos']}", end="")
    if t.get("_canonical"):   print(f"  [{t['_canonical']}]", end="")
    if t.get("_compound"):    print(f"  compound={t['_parts']}", end="")
    if t.get("_expansion"):   print(f"  -> {t['_expansion']}", end="")
    print()
```

---

### Standalone (without Transformers)

```python
from nedo_turkish_tokenizer import NedoTurkishTokenizer

tok = NedoTurkishTokenizer()

# Single text
tokens = tok("İSTANBUL'da meeting'e katılamadım")
for t in tokens:
    print(t["token"], t["token_type"], t["morph_pos"])

# Batch (parallel, all CPUs)
results = tok.batch_tokenize(["metin 1", "metin 2", "metin 3"], workers=4)

# TR coverage stats
s = tok.stats(tokens)
print(f"TR%: {s['tr_pct']}  Pure%: {s['pure_pct']}")
```

---

### Example Output

Input: `"İSTANBUL'da meeting'e katılamadım"`

| token | token_type | morph_pos | notes |
|---|---|---|---|
| `<uppercase_word>` | ROOT | 0 | ALL CAPS marker (Fix 1) |
| ` istanbul` | ROOT | 0 | lowercased |
| `'` | PUNCT | 0 | Fixed boundary |
| `da` | SUFFIX | 1 | `-LOC` [LOC] |
| ` meeting` | FOREIGN | 0 | TDK lookup (Fix 7) |
| `e` | SUFFIX | 1 | `-DAT` [DAT] |
| ` katılmak` | ROOT | 0 | Root corrected (Fix 4) |
| `lama` | SUFFIX | 1 | `-VN+NEG` |
| `d` | SUFFIX | 2 | `-PAST` |
| `ım` | SUFFIX | 3 | `-1SG` [1SG] |

---

## Output Fields

Every token dict contains:

| Field | Type | Description |
|---|---|---|
| `token` | `str` | Token string — leading space means word-initial |
| `token_type` | `str` | Morphological type (ROOT, SUFFIX, FOREIGN, PUNCT, etc.) |
| `morph_pos` | `int` | Position within word: `0`=root/initial, `1`=1st suffix, `2`=2nd suffix... |

### Token Types

| Type | Description | Example |
|---|---|---|
| `ROOT` | Turkish root word | `kitap`, `gel` |
| `SUFFIX` | Turkish morphological suffix | `lar`, `da`, `dı` |
| `FOREIGN` | Foreign/loanword root | `meeting`, `zoom`, `tweet` |
| `BPE` | Unknown subword (fallback) | rare/OOV fragments |
| `PUNCT` | Punctuation | `.`, `,`, `?` |
| `NUM` | Number | `3.5`, `%85` |
| `DATE` | Date | `14.03.2026` |
| `UNIT` | Measurement unit | `km`, `mg`, `TL` |
| `URL` | Web address | `https://...` |
| `MENTION` | @username | `@ethosoft` |
| `HASHTAG` | #topic | `#NLP` |
| `EMOJI` | Emoji | |

### Optional Metadata Fields

| Field | Description |
|---|---|
| `_canonical` | Canonical morpheme: `"lar"/"ler"` -> `"PL"`, `"dan"/"den"` -> `"ABL"` |
| `_suffix_label` | Detailed morphological label: `-PL+ACC`, `-P3+LOC`, ... |
| `_foreign` | `True` — foreign root detected by TDK lookup |
| `_caps` | `True` — originally ALL CAPS word |
| `_domain` | `True` — medical / sports / tourism domain word |
| `_compound` | `True` — compound word (e.g. `başbakan`) |
| `_parts` | Compound parts: `["baş", "bakan"]` |
| `_expansion` | Acronym expansion: `"CMV"` -> `"Sitomegalovirüs"` |
| `_pos` | POS tag from Zemberek: `Noun`, `Verb`, `Adj`, `Num`... |
| `_lemma` | Lemma from Zemberek: `"gelir"` -> `"gelmek"` (when verb) |
| `_disambiguated` | `True` — context disambiguation applied (`"yüz"`, `"gelir"`...) |
| `_root_corrected` | `True` — phonetic root correction: `"gök"` -> `"göğüs"` |

---

## How It Works

NedoTurkishTokenizer wraps the base `turkish-tokenizer` BPE model with **12 sequential morphological fixes**:

| Fix | Problem | Solution |
|---|---|---|
| 1 | `İSTANBUL` -> 16 BPE tokens | Lowercase before tokenization, restore `<uppercase_word>` marker |
| 2 | `meeting'e` -> broken BPE | Detect foreign base + Turkish suffix, split at apostrophe |
| 3 | Turkish suffixes classified as BPE | 260+ suffix patterns reclassified -> SUFFIX |
| 4 | Wrong roots (`gök` for `göğüs`) | Zemberek phonetic root validation & correction |
| 5 | Punctuation counted as BPE | Classify as PUNCT |
| 6 | Medical/domain terms as BPE | 500+ medical, sports, tourism root vocabulary |
| 7 | Foreign words as BPE | TDK 76K+ word lookup -> FOREIGN ROOT |
| 8 | Numbers, URLs, mentions fragmented | Pre-tokenization placeholder normalization |
| 9 | `lar`/`ler` different IDs for same morpheme | Allomorph canonicalization (`PL`, `ACC`, `DAT`...) |
| 10 | `başbakan` as single unknown ROOT | Compound word decomposition |
| 11 | `CMV`, `NATO` without meaning | Acronym expansion dictionary (100+ entries) |
| 12 | `yüz` = 100 or face or swim? | Zemberek sentence-level context disambiguation |

---

## License

MIT (c) [Ethosoft](https://huggingface.co/Ethosoft)