File size: 6,274 Bytes
a0e8f24 47e9fd4 a0e8f24 cfffd93 47e9fd4 a0e8f24 cfffd93 a0e8f24 3a27be3 a0e8f24 cfffd93 a0e8f24 47e9fd4 3a27be3 b719e3c 47e9fd4 a0e8f24 cfffd93 a0e8f24 b719e3c a0e8f24 47e9fd4 a0e8f24 cfffd93 47e9fd4 cfffd93 47e9fd4 cfffd93 47e9fd4 cfffd93 47e9fd4 a0e8f24 cfffd93 a0e8f24 cfffd93 a0e8f24 47e9fd4 a0e8f24 47e9fd4 a0e8f24 47e9fd4 864ffd2 47e9fd4 864ffd2 47e9fd4 864ffd2 47e9fd4 532470d 47e9fd4 532470d 864ffd2 47e9fd4 a0e8f24 47e9fd4 a0e8f24 47e9fd4 532470d a0e8f24 47e9fd4 cfffd93 a0e8f24 cfffd93 47e9fd4 cfffd93 a0e8f24 47e9fd4 a0e8f24 47e9fd4 a0e8f24 cfffd93 a0e8f24 47e9fd4 cfffd93 47e9fd4 cfffd93 47e9fd4 cfffd93 47e9fd4 a0e8f24 cfffd93 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 | ---
language:
- tr
tags:
- tokenizer
- morphology
- turkish
- nlp
- transformers
license: mit
library_name: nedo-turkish-tokenizer
pipeline_tag: token-classification
---
# NedoTurkishTokenizer
**Turkish morphological tokenizer — TR-MMLU world record 92.64%**
NedoTurkishTokenizer performs linguistically-aware tokenization of Turkish text using morphological rules. Unlike BPE-based tokenizers, it produces meaningful morphological units (roots and suffixes) aligned with Turkish grammar, powered by [Zemberek NLP](https://github.com/ahmetaa/zemberek-nlp).
## Model Details
| | |
|---|---|
| **Developer** | [Ethosoft](https://huggingface.co/Ethosoft) |
| **Language** | Turkish (`tr`) |
| **License** | MIT |
| **Benchmark** | TR-MMLU **92.64%** (world record) |
| **Morphological engine** | zemberek-python |
---
## Use This Model
### Installation
```bash
pip install git+https://huggingface.co/Ethosoft/NedoTurkishTokenizer
```
---
### With Transformers (`AutoTokenizer`)
```python
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("Ethosoft/NedoTurkishTokenizer", trust_remote_code=True)
out = tok("Türk dili, morfolojik açıdan zengin bir dildir.")
print(out["input_ids"]) # hash-stable int IDs
print(out["attention_mask"]) # [1, 1, 1, ...]
print(out["token_type_ids"]) # 0=root, 1=suffix, 2=bpe, 3=punct, 4=num, 5=url/social
for t in out["morphological_tokens"]:
print(t["token"], t["token_type"], t["morph_pos"])
```
**Batch tokenization:**
```python
out = tok(["Türkçe metin.", "Another sentence with code-switching."])
# out["input_ids"] -> list of lists
```
**Direct morphological tokenization:**
```python
tokens = tok.morphological_tokenize("Başbakan Ankara'da toplantı yaptı.")
for t in tokens:
print(f"{t['token']:20s} {t['token_type']:8s} pos={t['morph_pos']}", end="")
if t.get("_canonical"): print(f" [{t['_canonical']}]", end="")
if t.get("_compound"): print(f" compound={t['_parts']}", end="")
if t.get("_expansion"): print(f" -> {t['_expansion']}", end="")
print()
```
---
### Standalone (without Transformers)
```python
from nedo_turkish_tokenizer import NedoTurkishTokenizer
tok = NedoTurkishTokenizer()
# Single text
tokens = tok("İSTANBUL'da meeting'e katılamadım")
for t in tokens:
print(t["token"], t["token_type"], t["morph_pos"])
# Batch (parallel, all CPUs)
results = tok.batch_tokenize(["metin 1", "metin 2", "metin 3"], workers=4)
# TR coverage stats
s = tok.stats(tokens)
print(f"TR%: {s['tr_pct']} Pure%: {s['pure_pct']}")
```
---
### Example Output
Input: `"İSTANBUL'da meeting'e katılamadım"`
| token | token_type | morph_pos | notes |
|---|---|---|---|
| `<uppercase_word>` | ROOT | 0 | ALL CAPS marker (Fix 1) |
| ` istanbul` | ROOT | 0 | lowercased |
| `'` | PUNCT | 0 | Fixed boundary |
| `da` | SUFFIX | 1 | `-LOC` [LOC] |
| ` meeting` | FOREIGN | 0 | TDK lookup (Fix 7) |
| `e` | SUFFIX | 1 | `-DAT` [DAT] |
| ` katılmak` | ROOT | 0 | Root corrected (Fix 4) |
| `lama` | SUFFIX | 1 | `-VN+NEG` |
| `d` | SUFFIX | 2 | `-PAST` |
| `ım` | SUFFIX | 3 | `-1SG` [1SG] |
---
## Output Fields
Every token dict contains:
| Field | Type | Description |
|---|---|---|
| `token` | `str` | Token string — leading space means word-initial |
| `token_type` | `str` | Morphological type (ROOT, SUFFIX, FOREIGN, PUNCT, etc.) |
| `morph_pos` | `int` | Position within word: `0`=root/initial, `1`=1st suffix, `2`=2nd suffix... |
### Token Types
| Type | Description | Example |
|---|---|---|
| `ROOT` | Turkish root word | `kitap`, `gel` |
| `SUFFIX` | Turkish morphological suffix | `lar`, `da`, `dı` |
| `FOREIGN` | Foreign/loanword root | `meeting`, `zoom`, `tweet` |
| `BPE` | Unknown subword (fallback) | rare/OOV fragments |
| `PUNCT` | Punctuation | `.`, `,`, `?` |
| `NUM` | Number | `3.5`, `%85` |
| `DATE` | Date | `14.03.2026` |
| `UNIT` | Measurement unit | `km`, `mg`, `TL` |
| `URL` | Web address | `https://...` |
| `MENTION` | @username | `@ethosoft` |
| `HASHTAG` | #topic | `#NLP` |
| `EMOJI` | Emoji | |
### Optional Metadata Fields
| Field | Description |
|---|---|
| `_canonical` | Canonical morpheme: `"lar"/"ler"` -> `"PL"`, `"dan"/"den"` -> `"ABL"` |
| `_suffix_label` | Detailed morphological label: `-PL+ACC`, `-P3+LOC`, ... |
| `_foreign` | `True` — foreign root detected by TDK lookup |
| `_caps` | `True` — originally ALL CAPS word |
| `_domain` | `True` — medical / sports / tourism domain word |
| `_compound` | `True` — compound word (e.g. `başbakan`) |
| `_parts` | Compound parts: `["baş", "bakan"]` |
| `_expansion` | Acronym expansion: `"CMV"` -> `"Sitomegalovirüs"` |
| `_pos` | POS tag from Zemberek: `Noun`, `Verb`, `Adj`, `Num`... |
| `_lemma` | Lemma from Zemberek: `"gelir"` -> `"gelmek"` (when verb) |
| `_disambiguated` | `True` — context disambiguation applied (`"yüz"`, `"gelir"`...) |
| `_root_corrected` | `True` — phonetic root correction: `"gök"` -> `"göğüs"` |
---
## How It Works
NedoTurkishTokenizer wraps the base `turkish-tokenizer` BPE model with **12 sequential morphological fixes**:
| Fix | Problem | Solution |
|---|---|---|
| 1 | `İSTANBUL` -> 16 BPE tokens | Lowercase before tokenization, restore `<uppercase_word>` marker |
| 2 | `meeting'e` -> broken BPE | Detect foreign base + Turkish suffix, split at apostrophe |
| 3 | Turkish suffixes classified as BPE | 260+ suffix patterns reclassified -> SUFFIX |
| 4 | Wrong roots (`gök` for `göğüs`) | Zemberek phonetic root validation & correction |
| 5 | Punctuation counted as BPE | Classify as PUNCT |
| 6 | Medical/domain terms as BPE | 500+ medical, sports, tourism root vocabulary |
| 7 | Foreign words as BPE | TDK 76K+ word lookup -> FOREIGN ROOT |
| 8 | Numbers, URLs, mentions fragmented | Pre-tokenization placeholder normalization |
| 9 | `lar`/`ler` different IDs for same morpheme | Allomorph canonicalization (`PL`, `ACC`, `DAT`...) |
| 10 | `başbakan` as single unknown ROOT | Compound word decomposition |
| 11 | `CMV`, `NATO` without meaning | Acronym expansion dictionary (100+ entries) |
| 12 | `yüz` = 100 or face or swim? | Zemberek sentence-level context disambiguation |
---
## License
MIT (c) [Ethosoft](https://huggingface.co/Ethosoft)
|