| --- |
| language: |
| - tr |
| tags: |
| - tokenizer |
| - morphology |
| - turkish |
| - nlp |
| license: mit |
| library_name: turk-tokenizer |
| --- |
| |
| # TurkTokenizer |
|
|
| **Turkish morphological tokenizer — TR-MMLU world record 92%** |
|
|
| TurkTokenizer performs linguistically-aware tokenization of Turkish text using morphological rules. Unlike BPE-based tokenizers, it produces meaningful morphological units (roots and suffixes) aligned with Turkish grammar. |
|
|
| ## Installation |
|
|
| ```bash |
| pip install git+https://huggingface.co/Ethosoft/turk-tokenizer |
| ``` |
|
|
| **Java is required** (for Zemberek morphological analysis): |
|
|
| | OS | Command | |
| |---|---| |
| | Ubuntu / Debian | `sudo apt install default-jre` | |
| | Fedora / RHEL | `sudo dnf install java-latest-openjdk` | |
| | macOS | `brew install openjdk` | |
| | Windows | `winget install Microsoft.OpenJDK.21` | |
|
|
| ## Quick Start |
|
|
| ```python |
| from turk_tokenizer import TurkTokenizer |
| |
| tok = TurkTokenizer() |
| tokens = tok("İstanbul'da meeting'e katılamadım") |
| |
| for t in tokens: |
| print(t["token"], t["token_type"], t["morph_pos"]) |
| ``` |
|
|
| Output: |
| ``` |
| <uppercase_word> ROOT 0 |
| istanbul ROOT 0 |
| da SUFFIX 1 |
| meeting FOREIGN 0 |
| e SUFFIX 1 |
| katılama ROOT 0 |
| dı SUFFIX 1 |
| m SUFFIX 2 |
| ``` |
|
|
| ## Output Fields |
|
|
| Each token is a dict with the following guaranteed fields: |
|
|
| | Field | Type | Description | |
| |---|---|---| |
| | `token` | `str` | Token string (leading space = word-initial) | |
| | `token_type` | `str` | See types below | |
| | `morph_pos` | `int` | `0` = root/word-initial, `1` = first suffix, `2` = second… | |
|
|
| ### Token Types |
|
|
| | Type | Description | |
| |---|---| |
| | `ROOT` | Turkish root word | |
| | `SUFFIX` | Turkish morphological suffix | |
| | `FOREIGN` | Foreign/loanword root (e.g. "meeting", "zoom") | |
| | `BPE` | Unknown subword (fallback) | |
| | `PUNCT` | Punctuation mark | |
| | `NUM` | Number | |
| | `DATE` | Date | |
| | `UNIT` | Measurement unit | |
| | `URL` | Web URL | |
| | `MENTION` | @username | |
| | `HASHTAG` | #topic | |
| | `EMOJI` | Emoji | |
|
|
| ### Optional Metadata Fields |
|
|
| | Field | Description | |
| |---|---| |
| | `_canonical` | Canonical morpheme ID (e.g. `"PL"`, `"ACC"`, `"DAT"`) | |
| | `_suffix_label` | Detailed morphological label (e.g. `"-PL+ACC"`) | |
| | `_foreign` | `True` if foreign root | |
| | `_caps` | `True` if originally ALL CAPS | |
| | `_domain` | `True` if medical/sports/tourism domain | |
| | `_compound` | `True` if compound word | |
| | `_parts` | Compound word parts | |
| | `_expansion` | Acronym expansion (e.g. `"CMV"` → `"Sitomegalovirüs"`) | |
| | `_pos` | POS tag from Zemberek (Noun, Verb, Adj…) | |
| | `_lemma` | Lemma from Zemberek | |
| | `_disambiguated` | `True` if context disambiguation was applied | |
| | `_root_corrected` | `True` if root was corrected by Zemberek | |
|
|
| ## Batch Tokenization |
|
|
| ```python |
| texts = ["Ankara'da kar yağıyor.", "Meeting'e katılacak mısın?"] |
| results = tok.batch_tokenize(texts, workers=4) |
| ``` |
|
|
| ## Statistics |
|
|
| ```python |
| tokens = tok("Türk dili zengin bir morfolojiye sahiptir.") |
| s = tok.stats(tokens) |
| print(f"TR coverage: {s['tr_pct']}%") |
| ``` |
|
|
| ## Morphological Fixes Applied |
|
|
| 1. **ALL CAPS** — `"İSTANBUL"` → 2 tokens instead of 16 |
| 2. **Apostrophe splitting** — `"meeting'e"` → `[meeting:FOREIGN][e:SUFFIX]` |
| 3. **BPE→SUFFIX** — 260+ suffix patterns reclassified |
| 4. **Zemberek root validation** — phonetic root correction (`"gök"` → `"göğüs"`) |
| 5. **Punctuation** — classified as PUNCT (counted in TR coverage) |
| 6. **Domain vocabulary** — 500+ medical/sports/tourism roots |
| 7. **TDK FOREIGN detection** — 76K+ Turkish words used as reference |
| 8. **Special token normalization** — NUM, DATE, URL, MENTION, HASHTAG, EMOJI |
| 9. **Allomorph canonicalization** — `"lar"/"ler"` → `PL`, `"dan"/"den"` → `ABL` |
| 10. **Compound decomposition** — `"başbakan"` → `["baş", "bakan"]` |
| 11. **Acronym expansion** — `"CMV"` → `"Sitomegalovirüs"` |
| 12. **Context disambiguation** — Zemberek sentence-level POS selection |
|
|
| ## Benchmark |
|
|
| | Benchmark | Score | |
| |---|---| |
| | TR-MMLU | **92%** (world record) | |
|
|
| ## License |
|
|
| MIT |
|
|