--- language: - tr tags: - tokenizer - morphology - turkish - nlp license: mit library_name: turk-tokenizer --- # TurkTokenizer **Turkish morphological tokenizer — TR-MMLU world record 92%** TurkTokenizer performs linguistically-aware tokenization of Turkish text using morphological rules. Unlike BPE-based tokenizers, it produces meaningful morphological units (roots and suffixes) aligned with Turkish grammar. ## Installation ```bash pip install git+https://huggingface.co/Ethosoft/turk-tokenizer ``` **Java is required** (for Zemberek morphological analysis): | OS | Command | |---|---| | Ubuntu / Debian | `sudo apt install default-jre` | | Fedora / RHEL | `sudo dnf install java-latest-openjdk` | | macOS | `brew install openjdk` | | Windows | `winget install Microsoft.OpenJDK.21` | ## Quick Start ```python from turk_tokenizer import TurkTokenizer tok = TurkTokenizer() tokens = tok("İstanbul'da meeting'e katılamadım") for t in tokens: print(t["token"], t["token_type"], t["morph_pos"]) ``` Output: ``` ROOT 0 istanbul ROOT 0 da SUFFIX 1 meeting FOREIGN 0 e SUFFIX 1 katılama ROOT 0 dı SUFFIX 1 m SUFFIX 2 ``` ## Output Fields Each token is a dict with the following guaranteed fields: | Field | Type | Description | |---|---|---| | `token` | `str` | Token string (leading space = word-initial) | | `token_type` | `str` | See types below | | `morph_pos` | `int` | `0` = root/word-initial, `1` = first suffix, `2` = second… | ### Token Types | Type | Description | |---|---| | `ROOT` | Turkish root word | | `SUFFIX` | Turkish morphological suffix | | `FOREIGN` | Foreign/loanword root (e.g. "meeting", "zoom") | | `BPE` | Unknown subword (fallback) | | `PUNCT` | Punctuation mark | | `NUM` | Number | | `DATE` | Date | | `UNIT` | Measurement unit | | `URL` | Web URL | | `MENTION` | @username | | `HASHTAG` | #topic | | `EMOJI` | Emoji | ### Optional Metadata Fields | Field | Description | |---|---| | `_canonical` | Canonical morpheme ID (e.g. `"PL"`, `"ACC"`, `"DAT"`) | | `_suffix_label` | Detailed morphological label (e.g. `"-PL+ACC"`) | | `_foreign` | `True` if foreign root | | `_caps` | `True` if originally ALL CAPS | | `_domain` | `True` if medical/sports/tourism domain | | `_compound` | `True` if compound word | | `_parts` | Compound word parts | | `_expansion` | Acronym expansion (e.g. `"CMV"` → `"Sitomegalovirüs"`) | | `_pos` | POS tag from Zemberek (Noun, Verb, Adj…) | | `_lemma` | Lemma from Zemberek | | `_disambiguated` | `True` if context disambiguation was applied | | `_root_corrected` | `True` if root was corrected by Zemberek | ## Batch Tokenization ```python texts = ["Ankara'da kar yağıyor.", "Meeting'e katılacak mısın?"] results = tok.batch_tokenize(texts, workers=4) ``` ## Statistics ```python tokens = tok("Türk dili zengin bir morfolojiye sahiptir.") s = tok.stats(tokens) print(f"TR coverage: {s['tr_pct']}%") ``` ## Morphological Fixes Applied 1. **ALL CAPS** — `"İSTANBUL"` → 2 tokens instead of 16 2. **Apostrophe splitting** — `"meeting'e"` → `[meeting:FOREIGN][e:SUFFIX]` 3. **BPE→SUFFIX** — 260+ suffix patterns reclassified 4. **Zemberek root validation** — phonetic root correction (`"gök"` → `"göğüs"`) 5. **Punctuation** — classified as PUNCT (counted in TR coverage) 6. **Domain vocabulary** — 500+ medical/sports/tourism roots 7. **TDK FOREIGN detection** — 76K+ Turkish words used as reference 8. **Special token normalization** — NUM, DATE, URL, MENTION, HASHTAG, EMOJI 9. **Allomorph canonicalization** — `"lar"/"ler"` → `PL`, `"dan"/"den"` → `ABL` 10. **Compound decomposition** — `"başbakan"` → `["baş", "bakan"]` 11. **Acronym expansion** — `"CMV"` → `"Sitomegalovirüs"` 12. **Context disambiguation** — Zemberek sentence-level POS selection ## Benchmark | Benchmark | Score | |---|---| | TR-MMLU | **92%** (world record) | ## License MIT