--- language: - tr tags: - tokenizer - morphology - turkish - nlp - transformers license: mit library_name: nedo-turkish-tokenizer pipeline_tag: token-classification --- # NedoTurkishTokenizer **Turkish morphological tokenizer — TR-MMLU world record 92.64%** NedoTurkishTokenizer performs linguistically-aware tokenization of Turkish text using morphological rules. Unlike BPE-based tokenizers, it produces meaningful morphological units (roots and suffixes) aligned with Turkish grammar, powered by [Zemberek NLP](https://github.com/ahmetaa/zemberek-nlp). ## Model Details | | | |---|---| | **Developer** | [Ethosoft](https://huggingface.co/Ethosoft) | | **Language** | Turkish (`tr`) | | **License** | MIT | | **Benchmark** | TR-MMLU **92.64%** (world record) | | **Morphological engine** | zemberek-python | --- ## Use This Model ### Installation ```bash pip install git+https://huggingface.co/Ethosoft/NedoTurkishTokenizer ``` --- ### With Transformers (`AutoTokenizer`) ```python from transformers import AutoTokenizer tok = AutoTokenizer.from_pretrained("Ethosoft/NedoTurkishTokenizer", trust_remote_code=True) out = tok("Türk dili, morfolojik açıdan zengin bir dildir.") print(out["input_ids"]) # hash-stable int IDs print(out["attention_mask"]) # [1, 1, 1, ...] print(out["token_type_ids"]) # 0=root, 1=suffix, 2=bpe, 3=punct, 4=num, 5=url/social for t in out["morphological_tokens"]: print(t["token"], t["token_type"], t["morph_pos"]) ``` **Batch tokenization:** ```python out = tok(["Türkçe metin.", "Another sentence with code-switching."]) # out["input_ids"] -> list of lists ``` **Direct morphological tokenization:** ```python tokens = tok.morphological_tokenize("Başbakan Ankara'da toplantı yaptı.") for t in tokens: print(f"{t['token']:20s} {t['token_type']:8s} pos={t['morph_pos']}", end="") if t.get("_canonical"): print(f" [{t['_canonical']}]", end="") if t.get("_compound"): print(f" compound={t['_parts']}", end="") if t.get("_expansion"): print(f" -> {t['_expansion']}", end="") print() ``` --- ### Standalone (without Transformers) ```python from nedo_turkish_tokenizer import NedoTurkishTokenizer tok = NedoTurkishTokenizer() # Single text tokens = tok("İSTANBUL'da meeting'e katılamadım") for t in tokens: print(t["token"], t["token_type"], t["morph_pos"]) # Batch (parallel, all CPUs) results = tok.batch_tokenize(["metin 1", "metin 2", "metin 3"], workers=4) # TR coverage stats s = tok.stats(tokens) print(f"TR%: {s['tr_pct']} Pure%: {s['pure_pct']}") ``` --- ### Example Output Input: `"İSTANBUL'da meeting'e katılamadım"` | token | token_type | morph_pos | notes | |---|---|---|---| | `` | ROOT | 0 | ALL CAPS marker (Fix 1) | | ` istanbul` | ROOT | 0 | lowercased | | `'` | PUNCT | 0 | Fixed boundary | | `da` | SUFFIX | 1 | `-LOC` [LOC] | | ` meeting` | FOREIGN | 0 | TDK lookup (Fix 7) | | `e` | SUFFIX | 1 | `-DAT` [DAT] | | ` katılmak` | ROOT | 0 | Root corrected (Fix 4) | | `lama` | SUFFIX | 1 | `-VN+NEG` | | `d` | SUFFIX | 2 | `-PAST` | | `ım` | SUFFIX | 3 | `-1SG` [1SG] | --- ## Output Fields Every token dict contains: | Field | Type | Description | |---|---|---| | `token` | `str` | Token string — leading space means word-initial | | `token_type` | `str` | Morphological type (ROOT, SUFFIX, FOREIGN, PUNCT, etc.) | | `morph_pos` | `int` | Position within word: `0`=root/initial, `1`=1st suffix, `2`=2nd suffix... | ### Token Types | Type | Description | Example | |---|---|---| | `ROOT` | Turkish root word | `kitap`, `gel` | | `SUFFIX` | Turkish morphological suffix | `lar`, `da`, `dı` | | `FOREIGN` | Foreign/loanword root | `meeting`, `zoom`, `tweet` | | `BPE` | Unknown subword (fallback) | rare/OOV fragments | | `PUNCT` | Punctuation | `.`, `,`, `?` | | `NUM` | Number | `3.5`, `%85` | | `DATE` | Date | `14.03.2026` | | `UNIT` | Measurement unit | `km`, `mg`, `TL` | | `URL` | Web address | `https://...` | | `MENTION` | @username | `@ethosoft` | | `HASHTAG` | #topic | `#NLP` | | `EMOJI` | Emoji | | ### Optional Metadata Fields | Field | Description | |---|---| | `_canonical` | Canonical morpheme: `"lar"/"ler"` -> `"PL"`, `"dan"/"den"` -> `"ABL"` | | `_suffix_label` | Detailed morphological label: `-PL+ACC`, `-P3+LOC`, ... | | `_foreign` | `True` — foreign root detected by TDK lookup | | `_caps` | `True` — originally ALL CAPS word | | `_domain` | `True` — medical / sports / tourism domain word | | `_compound` | `True` — compound word (e.g. `başbakan`) | | `_parts` | Compound parts: `["baş", "bakan"]` | | `_expansion` | Acronym expansion: `"CMV"` -> `"Sitomegalovirüs"` | | `_pos` | POS tag from Zemberek: `Noun`, `Verb`, `Adj`, `Num`... | | `_lemma` | Lemma from Zemberek: `"gelir"` -> `"gelmek"` (when verb) | | `_disambiguated` | `True` — context disambiguation applied (`"yüz"`, `"gelir"`...) | | `_root_corrected` | `True` — phonetic root correction: `"gök"` -> `"göğüs"` | --- ## How It Works NedoTurkishTokenizer wraps the base `turkish-tokenizer` BPE model with **12 sequential morphological fixes**: | Fix | Problem | Solution | |---|---|---| | 1 | `İSTANBUL` -> 16 BPE tokens | Lowercase before tokenization, restore `` marker | | 2 | `meeting'e` -> broken BPE | Detect foreign base + Turkish suffix, split at apostrophe | | 3 | Turkish suffixes classified as BPE | 260+ suffix patterns reclassified -> SUFFIX | | 4 | Wrong roots (`gök` for `göğüs`) | Zemberek phonetic root validation & correction | | 5 | Punctuation counted as BPE | Classify as PUNCT | | 6 | Medical/domain terms as BPE | 500+ medical, sports, tourism root vocabulary | | 7 | Foreign words as BPE | TDK 76K+ word lookup -> FOREIGN ROOT | | 8 | Numbers, URLs, mentions fragmented | Pre-tokenization placeholder normalization | | 9 | `lar`/`ler` different IDs for same morpheme | Allomorph canonicalization (`PL`, `ACC`, `DAT`...) | | 10 | `başbakan` as single unknown ROOT | Compound word decomposition | | 11 | `CMV`, `NATO` without meaning | Acronym expansion dictionary (100+ entries) | | 12 | `yüz` = 100 or face or swim? | Zemberek sentence-level context disambiguation | --- ## License MIT (c) [Ethosoft](https://huggingface.co/Ethosoft)