File size: 6,274 Bytes

a0e8f24
 
 
 
 
 
 
 
47e9fd4
a0e8f24
cfffd93
47e9fd4
a0e8f24
 
cfffd93
a0e8f24
3a27be3
a0e8f24
cfffd93
a0e8f24
47e9fd4
 
 
 
 
 
 
3a27be3
b719e3c
47e9fd4
 
 
 
 
 
a0e8f24
 
cfffd93
a0e8f24
 
b719e3c
a0e8f24
47e9fd4
a0e8f24
cfffd93
47e9fd4
 
 
 
cfffd93
47e9fd4
 
 
 
 
 
 
 
 
 
 
 
 
cfffd93
47e9fd4
 
 
 
 
 
 
 
 
cfffd93
47e9fd4
 
 
 
 
 
a0e8f24
 
cfffd93
a0e8f24
cfffd93
a0e8f24
47e9fd4
 
a0e8f24
 
47e9fd4
 
 
 
 
 
 
a0e8f24
 
47e9fd4
864ffd2
47e9fd4
864ffd2
47e9fd4
864ffd2
47e9fd4
 
532470d
47e9fd4
532470d
 
 
 
 
 
 
 
864ffd2
47e9fd4
a0e8f24
 
 
47e9fd4
a0e8f24
 
 
47e9fd4
532470d
 
a0e8f24
 
 
47e9fd4
 
 
 
 
 
 
 
 
 
 
 
 
cfffd93
a0e8f24
 
 
 
 
cfffd93
 
47e9fd4
 
 
 
 
cfffd93
 
 
 
 
a0e8f24
47e9fd4
a0e8f24
47e9fd4
a0e8f24
cfffd93
a0e8f24
47e9fd4
 
cfffd93
 
 
47e9fd4
 
 
cfffd93
47e9fd4
cfffd93
47e9fd4
 
 
 
 
a0e8f24
 
 
cfffd93

---
language:
- tr
tags:
- tokenizer
- morphology
- turkish
- nlp
- transformers
license: mit
library_name: nedo-turkish-tokenizer
pipeline_tag: token-classification
---

# NedoTurkishTokenizer

**Turkish morphological tokenizer — TR-MMLU world record 92.64%**

NedoTurkishTokenizer performs linguistically-aware tokenization of Turkish text using morphological rules. Unlike BPE-based tokenizers, it produces meaningful morphological units (roots and suffixes) aligned with Turkish grammar, powered by [Zemberek NLP](https://github.com/ahmetaa/zemberek-nlp).

## Model Details

| | |
|---|---|
| **Developer** | [Ethosoft](https://huggingface.co/Ethosoft) |
| **Language** | Turkish (`tr`) |
| **License** | MIT |
| **Benchmark** | TR-MMLU **92.64%** (world record) |
| **Morphological engine** | zemberek-python |

---

## Use This Model

### Installation

```bash
pip install git+https://huggingface.co/Ethosoft/NedoTurkishTokenizer
```



---

### With Transformers (`AutoTokenizer`)

```python
from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("Ethosoft/NedoTurkishTokenizer", trust_remote_code=True)

out = tok("Türk dili, morfolojik açıdan zengin bir dildir.")
print(out["input_ids"])            # hash-stable int IDs
print(out["attention_mask"])       # [1, 1, 1, ...]
print(out["token_type_ids"])       # 0=root, 1=suffix, 2=bpe, 3=punct, 4=num, 5=url/social

for t in out["morphological_tokens"]:
    print(t["token"], t["token_type"], t["morph_pos"])
```

**Batch tokenization:**
```python
out = tok(["Türkçe metin.", "Another sentence with code-switching."])
# out["input_ids"]  -> list of lists
```

**Direct morphological tokenization:**
```python
tokens = tok.morphological_tokenize("Başbakan Ankara'da toplantı yaptı.")
for t in tokens:
    print(f"{t['token']:20s} {t['token_type']:8s} pos={t['morph_pos']}", end="")
    if t.get("_canonical"):   print(f"  [{t['_canonical']}]", end="")
    if t.get("_compound"):    print(f"  compound={t['_parts']}", end="")
    if t.get("_expansion"):   print(f"  -> {t['_expansion']}", end="")
    print()
```

---

### Standalone (without Transformers)

```python
from nedo_turkish_tokenizer import NedoTurkishTokenizer

tok = NedoTurkishTokenizer()

# Single text
tokens = tok("İSTANBUL'da meeting'e katılamadım")
for t in tokens:
    print(t["token"], t["token_type"], t["morph_pos"])

# Batch (parallel, all CPUs)
results = tok.batch_tokenize(["metin 1", "metin 2", "metin 3"], workers=4)

# TR coverage stats
s = tok.stats(tokens)
print(f"TR%: {s['tr_pct']}  Pure%: {s['pure_pct']}")
```

---

### Example Output

Input: `"İSTANBUL'da meeting'e katılamadım"`

| token | token_type | morph_pos | notes |
|---|---|---|---|
| `<uppercase_word>` | ROOT | 0 | ALL CAPS marker (Fix 1) |
| ` istanbul` | ROOT | 0 | lowercased |
| `'` | PUNCT | 0 | Fixed boundary |
| `da` | SUFFIX | 1 | `-LOC` [LOC] |
| ` meeting` | FOREIGN | 0 | TDK lookup (Fix 7) |
| `e` | SUFFIX | 1 | `-DAT` [DAT] |
| ` katılmak` | ROOT | 0 | Root corrected (Fix 4) |
| `lama` | SUFFIX | 1 | `-VN+NEG` |
| `d` | SUFFIX | 2 | `-PAST` |
| `ım` | SUFFIX | 3 | `-1SG` [1SG] |

---

## Output Fields

Every token dict contains:

| Field | Type | Description |
|---|---|---|
| `token` | `str` | Token string — leading space means word-initial |
| `token_type` | `str` | Morphological type (ROOT, SUFFIX, FOREIGN, PUNCT, etc.) |
| `morph_pos` | `int` | Position within word: `0`=root/initial, `1`=1st suffix, `2`=2nd suffix... |

### Token Types

| Type | Description | Example |
|---|---|---|
| `ROOT` | Turkish root word | `kitap`, `gel` |
| `SUFFIX` | Turkish morphological suffix | `lar`, `da`, `dı` |
| `FOREIGN` | Foreign/loanword root | `meeting`, `zoom`, `tweet` |
| `BPE` | Unknown subword (fallback) | rare/OOV fragments |
| `PUNCT` | Punctuation | `.`, `,`, `?` |
| `NUM` | Number | `3.5`, `%85` |
| `DATE` | Date | `14.03.2026` |
| `UNIT` | Measurement unit | `km`, `mg`, `TL` |
| `URL` | Web address | `https://...` |
| `MENTION` | @username | `@ethosoft` |
| `HASHTAG` | #topic | `#NLP` |
| `EMOJI` | Emoji | |

### Optional Metadata Fields

| Field | Description |
|---|---|
| `_canonical` | Canonical morpheme: `"lar"/"ler"` -> `"PL"`, `"dan"/"den"` -> `"ABL"` |
| `_suffix_label` | Detailed morphological label: `-PL+ACC`, `-P3+LOC`, ... |
| `_foreign` | `True` — foreign root detected by TDK lookup |
| `_caps` | `True` — originally ALL CAPS word |
| `_domain` | `True` — medical / sports / tourism domain word |
| `_compound` | `True` — compound word (e.g. `başbakan`) |
| `_parts` | Compound parts: `["baş", "bakan"]` |
| `_expansion` | Acronym expansion: `"CMV"` -> `"Sitomegalovirüs"` |
| `_pos` | POS tag from Zemberek: `Noun`, `Verb`, `Adj`, `Num`... |
| `_lemma` | Lemma from Zemberek: `"gelir"` -> `"gelmek"` (when verb) |
| `_disambiguated` | `True` — context disambiguation applied (`"yüz"`, `"gelir"`...) |
| `_root_corrected` | `True` — phonetic root correction: `"gök"` -> `"göğüs"` |

---

## How It Works

NedoTurkishTokenizer wraps the base `turkish-tokenizer` BPE model with **12 sequential morphological fixes**:

| Fix | Problem | Solution |
|---|---|---|
| 1 | `İSTANBUL` -> 16 BPE tokens | Lowercase before tokenization, restore `<uppercase_word>` marker |
| 2 | `meeting'e` -> broken BPE | Detect foreign base + Turkish suffix, split at apostrophe |
| 3 | Turkish suffixes classified as BPE | 260+ suffix patterns reclassified -> SUFFIX |
| 4 | Wrong roots (`gök` for `göğüs`) | Zemberek phonetic root validation & correction |
| 5 | Punctuation counted as BPE | Classify as PUNCT |
| 6 | Medical/domain terms as BPE | 500+ medical, sports, tourism root vocabulary |
| 7 | Foreign words as BPE | TDK 76K+ word lookup -> FOREIGN ROOT |
| 8 | Numbers, URLs, mentions fragmented | Pre-tokenization placeholder normalization |
| 9 | `lar`/`ler` different IDs for same morpheme | Allomorph canonicalization (`PL`, `ACC`, `DAT`...) |
| 10 | `başbakan` as single unknown ROOT | Compound word decomposition |
| 11 | `CMV`, `NATO` without meaning | Acronym expansion dictionary (100+ entries) |
| 12 | `yüz` = 100 or face or swim? | Zemberek sentence-level context disambiguation |

---

## License

MIT (c) [Ethosoft](https://huggingface.co/Ethosoft)