---
language:
- tr
tags:
- tokenizer
- morphology
- turkish
- nlp
- transformers
license: mit
library_name: nedo-turkish-tokenizer
pipeline_tag: token-classification
---

# NedoTurkishTokenizer

**Turkish morphological tokenizer — TR-MMLU world record 92.64%**

NedoTurkishTokenizer performs linguistically-aware tokenization of Turkish text using morphological rules. Unlike BPE-based tokenizers, it produces meaningful morphological units (roots and suffixes) aligned with Turkish grammar, powered by [Zemberek NLP](https://github.com/ahmetaa/zemberek-nlp).

## Model Details

| | |
|---|---|
| **Developer** | [Ethosoft](https://huggingface.co/Ethosoft) |
| **Language** | Turkish (`tr`) |
| **License** | MIT |
| **Benchmark** | TR-MMLU **92.64%** (world record) |
| **Morphological engine** | zemberek-python |

---

## Use This Model

### Installation

```bash
pip install git+https://huggingface.co/Ethosoft/NedoTurkishTokenizer
```


---

### With Transformers (`AutoTokenizer`)

```python
from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("Ethosoft/NedoTurkishTokenizer", trust_remote_code=True)

out = tok("Türk dili, morfolojik açıdan zengin bir dildir.")
print(out["input_ids"])            # hash-stable int IDs
print(out["attention_mask"])       # [1, 1, 1, ...]
print(out["token_type_ids"])       # 0=root, 1=suffix, 2=bpe, 3=punct, 4=num, 5=url/social

for t in out["morphological_tokens"]:
    print(t["token"], t["token_type"], t["morph_pos"])
```

**Batch tokenization:**
```python
out = tok(["Türkçe metin.", "Another sentence with code-switching."])
# out["input_ids"]  -> list of lists
```

**Direct morphological tokenization:**
```python
tokens = tok.morphological_tokenize("Başbakan Ankara'da toplantı yaptı.")
for t in tokens:
    print(f"{t['token']:20s} {t['token_type']:8s} pos={t['morph_pos']}", end="")
    if t.get("_canonical"):   print(f"  [{t['_canonical']}]", end="")
    if t.get("_compound"):    print(f"  compound={t['_parts']}", end="")
    if t.get("_expansion"):   print(f"  -> {t['_expansion']}", end="")
    print()
```

---

### Standalone (without Transformers)

```python
from nedo_turkish_tokenizer import NedoTurkishTokenizer

tok = NedoTurkishTokenizer()

# Single text
tokens = tok("İSTANBUL'da meeting'e katılamadım")
for t in tokens:
    print(t["token"], t["token_type"], t["morph_pos"])

# Batch (parallel, all CPUs)
results = tok.batch_tokenize(["metin 1", "metin 2", "metin 3"], workers=4)

# TR coverage stats
s = tok.stats(tokens)
print(f"TR%: {s['tr_pct']}  Pure%: {s['pure_pct']}")
```

---

### Example Output

Input: `"İSTANBUL'da meeting'e katılamadım"`

| token | token_type | morph_pos | notes |
|---|---|---|---|
| `<uppercase_word>` | ROOT | 0 | ALL CAPS marker (Fix 1) |
| ` istanbul` | ROOT | 0 | lowercased |
| `'` | PUNCT | 0 | Fixed boundary |
| `da` | SUFFIX | 1 | `-LOC` [LOC] |
| ` meeting` | FOREIGN | 0 | TDK lookup (Fix 7) |
| `e` | SUFFIX | 1 | `-DAT` [DAT] |
| ` katılmak` | ROOT | 0 | Root corrected (Fix 4) |
| `lama` | SUFFIX | 1 | `-VN+NEG` |
| `d` | SUFFIX | 2 | `-PAST` |
| `ım` | SUFFIX | 3 | `-1SG` [1SG] |

---

## Output Fields

Every token dict contains:

| Field | Type | Description |
|---|---|---|
| `token` | `str` | Token string — leading space means word-initial |
| `token_type` | `str` | Morphological type (ROOT, SUFFIX, FOREIGN, PUNCT, etc.) |
| `morph_pos` | `int` | Position within word: `0`=root/initial, `1`=1st suffix, `2`=2nd suffix... |

### Token Types

| Type | Description | Example |
|---|---|---|
| `ROOT` | Turkish root word | `kitap`, `gel` |
| `SUFFIX` | Turkish morphological suffix | `lar`, `da`, `dı` |
| `FOREIGN` | Foreign/loanword root | `meeting`, `zoom`, `tweet` |
| `BPE` | Unknown subword (fallback) | rare/OOV fragments |
| `PUNCT` | Punctuation | `.`, `,`, `?` |
| `NUM` | Number | `3.5`, `%85` |
| `DATE` | Date | `14.03.2026` |
| `UNIT` | Measurement unit | `km`, `mg`, `TL` |
| `URL` | Web address | `https://...` |
| `MENTION` | @username | `@ethosoft` |
| `HASHTAG` | #topic | `#NLP` |
| `EMOJI` | Emoji | |

### Optional Metadata Fields

| Field | Description |
|---|---|
| `_canonical` | Canonical morpheme: `"lar"/"ler"` -> `"PL"`, `"dan"/"den"` -> `"ABL"` |
| `_suffix_label` | Detailed morphological label: `-PL+ACC`, `-P3+LOC`, ... |
| `_foreign` | `True` — foreign root detected by TDK lookup |
| `_caps` | `True` — originally ALL CAPS word |
| `_domain` | `True` — medical / sports / tourism domain word |
| `_compound` | `True` — compound word (e.g. `başbakan`) |
| `_parts` | Compound parts: `["baş", "bakan"]` |
| `_expansion` | Acronym expansion: `"CMV"` -> `"Sitomegalovirüs"` |
| `_pos` | POS tag from Zemberek: `Noun`, `Verb`, `Adj`, `Num`... |
| `_lemma` | Lemma from Zemberek: `"gelir"` -> `"gelmek"` (when verb) |
| `_disambiguated` | `True` — context disambiguation applied (`"yüz"`, `"gelir"`...) |
| `_root_corrected` | `True` — phonetic root correction: `"gök"` -> `"göğüs"` |

---

## How It Works

NedoTurkishTokenizer wraps the base `turkish-tokenizer` BPE model with **12 sequential morphological fixes**:

| Fix | Problem | Solution |
|---|---|---|
| 1 | `İSTANBUL` -> 16 BPE tokens | Lowercase before tokenization, restore `<uppercase_word>` marker |
| 2 | `meeting'e` -> broken BPE | Detect foreign base + Turkish suffix, split at apostrophe |
| 3 | Turkish suffixes classified as BPE | 260+ suffix patterns reclassified -> SUFFIX |
| 4 | Wrong roots (`gök` for `göğüs`) | Zemberek phonetic root validation & correction |
| 5 | Punctuation counted as BPE | Classify as PUNCT |
| 6 | Medical/domain terms as BPE | 500+ medical, sports, tourism root vocabulary |
| 7 | Foreign words as BPE | TDK 76K+ word lookup -> FOREIGN ROOT |
| 8 | Numbers, URLs, mentions fragmented | Pre-tokenization placeholder normalization |
| 9 | `lar`/`ler` different IDs for same morpheme | Allomorph canonicalization (`PL`, `ACC`, `DAT`...) |
| 10 | `başbakan` as single unknown ROOT | Compound word decomposition |
| 11 | `CMV`, `NATO` without meaning | Acronym expansion dictionary (100+ entries) |
| 12 | `yüz` = 100 or face or swim? | Zemberek sentence-level context disambiguation |

---

## License

MIT (c) [Ethosoft](https://huggingface.co/Ethosoft)