| --- |
| language: |
| - tr |
| tags: |
| - tokenizer |
| - morphology |
| - turkish |
| - nlp |
| - transformers |
| license: mit |
| library_name: nedo-turkish-tokenizer |
| pipeline_tag: token-classification |
| --- |
| |
| # NedoTurkishTokenizer |
|
|
| **Turkish morphological tokenizer — TR-MMLU world record 92.64%** |
|
|
| NedoTurkishTokenizer performs linguistically-aware tokenization of Turkish text using morphological rules. Unlike BPE-based tokenizers, it produces meaningful morphological units (roots and suffixes) aligned with Turkish grammar, powered by [Zemberek NLP](https://github.com/ahmetaa/zemberek-nlp). |
|
|
| ## Model Details |
|
|
| | | | |
| |---|---| |
| | **Developer** | [Ethosoft](https://huggingface.co/Ethosoft) | |
| | **Language** | Turkish (`tr`) | |
| | **License** | MIT | |
| | **Benchmark** | TR-MMLU **92.64%** (world record) | |
| | **Morphological engine** | Zemberek NLP (bundled) | |
|
|
| --- |
|
|
| ## Use This Model |
|
|
| ### Installation |
|
|
| ```bash |
| pip install git+https://huggingface.co/Ethosoft/NedoTurkishTokenizer |
| ``` |
|
|
| > **Java is required** for Zemberek morphological analysis. |
| > If you get a Java error, install it first: |
| > |
| > | OS | Command | |
| > |---|---| |
| > | Ubuntu / Debian | `sudo apt install default-jre` | |
| > | Fedora / RHEL | `sudo dnf install java-latest-openjdk` | |
| > | macOS | `brew install openjdk` | |
| > | Windows | `winget install Microsoft.OpenJDK.21` | |
|
|
| --- |
|
|
| ### With Transformers (`AutoTokenizer`) |
|
|
| ```python |
| from transformers import AutoTokenizer |
| |
| tok = AutoTokenizer.from_pretrained("Ethosoft/NedoTurkishTokenizer", trust_remote_code=True) |
| |
| out = tok("Türk dili, morfolojik açıdan zengin bir dildir.") |
| print(out["input_ids"]) # hash-stable int IDs |
| print(out["attention_mask"]) # [1, 1, 1, ...] |
| print(out["token_type_ids"]) # 0=root, 1=suffix, 2=bpe, 3=punct, 4=num, 5=url/social |
| |
| for t in out["morphological_tokens"]: |
| print(t["token"], t["token_type"], t["morph_pos"]) |
| ``` |
|
|
| **Batch tokenization:** |
| ```python |
| out = tok(["Türkçe metin.", "Another sentence with code-switching."]) |
| # out["input_ids"] -> list of lists |
| ``` |
|
|
| **Direct morphological tokenization:** |
| ```python |
| tokens = tok.morphological_tokenize("Başbakan Ankara'da toplantı yaptı.") |
| for t in tokens: |
| print(f"{t['token']:20s} {t['token_type']:8s} pos={t['morph_pos']}", end="") |
| if t.get("_canonical"): print(f" [{t['_canonical']}]", end="") |
| if t.get("_compound"): print(f" compound={t['_parts']}", end="") |
| if t.get("_expansion"): print(f" -> {t['_expansion']}", end="") |
| print() |
| ``` |
|
|
| --- |
|
|
| ### Standalone (without Transformers) |
|
|
| ```python |
| from nedo_turkish_tokenizer import NedoTurkishTokenizer |
| |
| tok = NedoTurkishTokenizer() |
| |
| # Single text |
| tokens = tok("İSTANBUL'da meeting'e katılamadım") |
| for t in tokens: |
| print(t["token"], t["token_type"], t["morph_pos"]) |
| |
| # Batch (parallel, all CPUs) |
| results = tok.batch_tokenize(["metin 1", "metin 2", "metin 3"], workers=4) |
| |
| # TR coverage stats |
| s = tok.stats(tokens) |
| print(f"TR%: {s['tr_pct']} Pure%: {s['pure_pct']}") |
| ``` |
|
|
| --- |
|
|
| ### Example Output |
|
|
| Input: `"İSTANBUL'da meeting'e katılamadım"` |
|
|
| | token | token_type | morph_pos | notes | |
| |---|---|---|---| |
| | `<uppercase_word>` | ROOT | 0 | ALL CAPS marker | |
| | ` istanbul` | ROOT | 0 | lowercased | |
| | `da` | SUFFIX | 1 | `-LOC` | |
| | ` meeting` | FOREIGN | 0 | TDK'da yok | |
| | `e` | SUFFIX | 1 | `-DAT` | |
| | ` katılama` | ROOT | 0 | Zemberek validated | |
| | `dı` | SUFFIX | 1 | `-PST` `[PAST]` | |
| | `m` | SUFFIX | 2 | `-1SG` | |
|
|
| --- |
|
|
| ## Output Fields |
|
|
| Every token dict contains: |
|
|
| | Field | Type | Description | |
| |---|---|---| |
| | `token` | `str` | Token string — leading space means word-initial | |
| | `token_type` | `str` | Morphological type (see table below) | |
| | `morph_pos` | `int` | Position within word: `0`=root, `1`=1st suffix, `2`=2nd suffix... | |
|
|
| ### Token Types |
|
|
| | Type | Description | Example | |
| |---|---|---| |
| | `ROOT` | Turkish root word | `kitap`, `gel` | |
| | `SUFFIX` | Turkish morphological suffix | `lar`, `da`, `dı` | |
| | `FOREIGN` | Foreign/loanword root | `meeting`, `zoom`, `tweet` | |
| | `BPE` | Unknown subword (fallback) | rare/OOV fragments | |
| | `PUNCT` | Punctuation | `.`, `,`, `?` | |
| | `NUM` | Number | `3.5`, `%85` | |
| | `DATE` | Date | `14.03.2026` | |
| | `UNIT` | Measurement unit | `km`, `mg`, `TL` | |
| | `URL` | Web address | `https://...` | |
| | `MENTION` | @username | `@ethosoft` | |
| | `HASHTAG` | #topic | `#NLP` | |
| | `EMOJI` | Emoji | | |
|
|
| ### Optional Metadata Fields |
|
|
| | Field | Description | |
| |---|---| |
| | `_canonical` | Canonical morpheme: `"lar"/"ler"` -> `"PL"`, `"dan"/"den"` -> `"ABL"` | |
| | `_suffix_label` | Detailed morphological label: `-PL+ACC`, `-P3+LOC`, ... | |
| | `_foreign` | `True` — foreign root detected by TDK lookup | |
| | `_caps` | `True` — originally ALL CAPS word | |
| | `_domain` | `True` — medical / sports / tourism domain word | |
| | `_compound` | `True` — compound word (e.g. `başbakan`) | |
| | `_parts` | Compound parts: `["baş", "bakan"]` | |
| | `_expansion` | Acronym expansion: `"CMV"` -> `"Sitomegalovirüs"` | |
| | `_pos` | POS tag from Zemberek: `Noun`, `Verb`, `Adj`, `Num`... | |
| | `_lemma` | Lemma from Zemberek: `"gelir"` -> `"gelmek"` (when verb) | |
| | `_disambiguated` | `True` — context disambiguation applied (`"yüz"`, `"gelir"`...) | |
| | `_root_corrected` | `True` — phonetic root correction: `"gök"` -> `"göğüs"` | |
|
|
| --- |
|
|
| ## How It Works |
|
|
| NedoTurkishTokenizer wraps the base `turkish-tokenizer` BPE model with **12 sequential morphological fixes**: |
|
|
| | Fix | Problem | Solution | |
| |---|---|---| |
| | 1 | `İSTANBUL` -> 16 BPE tokens | Lowercase before tokenization, restore `<uppercase_word>` marker | |
| | 2 | `meeting'e` -> broken BPE | Detect foreign base + Turkish suffix, split at apostrophe | |
| | 3 | Turkish suffixes classified as BPE | 260+ suffix patterns reclassified -> SUFFIX | |
| | 4 | Wrong roots (`gök` for `göğüs`) | Zemberek phonetic root validation & correction | |
| | 5 | Punctuation counted as BPE | Classify as PUNCT | |
| | 6 | Medical/domain terms as BPE | 500+ medical, sports, tourism root vocabulary | |
| | 7 | Foreign words as BPE | TDK 76K+ word lookup -> FOREIGN ROOT | |
| | 8 | Numbers, URLs, mentions fragmented | Pre-tokenization placeholder normalization | |
| | 9 | `lar`/`ler` different IDs for same morpheme | Allomorph canonicalization (`PL`, `ACC`, `DAT`...) | |
| | 10 | `başbakan` as single unknown ROOT | Compound word decomposition | |
| | 11 | `CMV`, `NATO` without meaning | Acronym expansion dictionary (100+ entries) | |
| | 12 | `yüz` = 100 or face or swim? | Zemberek sentence-level context disambiguation | |
|
|
| --- |
|
|
| ## License |
|
|
| MIT (c) [Ethosoft](https://huggingface.co/Ethosoft) |
|
|